Insert Spark Dataframe into Hive Table - scala

I have a Hive Table existing, stored as orc, that has the same schema as the data frame I create with my spark job. If I save the data frame down as css, json, text, whatever then it works just fine. I can hand migrate these files into a Hive table just fine.
but when I try and insert directly into hive with
df.insertInto("table_name", true)
I get this error in the YARN UI:
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'cast(last_name_2 as array<double>)' due to data type mismatch: cannot cast StringType to ArrayType(DoubleType,true);
I've also tried register a temp table before calling insert and also using:
df.write.mode(SaveMode.Append).saveAsTable("default.table_name")
What am I doing wrong?

Related

How to fix incompatible Parquet schema error in Impala

Experts, i am querying a Hive external table (Data stored as Parquet on HDFS) without any issue using Hive/Pyspark. The same query when run on Impala is giving me "Incompatible Parquet Schema" error. The field is defined as 'String' in Hive DDL and only value it has is '000'.
Error-
File 'hdfs://home/user1/out/STATEMENTS/effective_dt=20200804/part-00007-79ea7da3-2165-4a92-9c30-a11e415e778c.c000' has an incompatible Parquet schema for column 'db.statements.profit'. Column type: CHAR(3), Parquet schema: optional int32 os_chargeoff [i:144 d:1 r:0]
Does anyone know how to fix this?
I tried different things - Refresh Table/Invalidate Metadata/Drop recreate Table without success.

Not able to insert data into hive elasticsearch index using spark SQL

I have used the following steps in hive terminal to insert into elasticsearch index -
Create hive table pointing to elasticsearch index
CREATE EXTERNAL TABLE test_es(
id string,
name string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource = test/person', 'es.mapping = id');
Create a staging table and insert data into it
Create table emp(id string,name string) row format delimited fields terminated by ',';
load data local inpath '/home/monami/data.txt' into table emp;
Insert data from staging table into the hive elasticsearch index
insert overwrite table test_es select * from emp;
I could browse through the hive elasticsearch index successfully following the above steps in hive CLI. But whenever I am trying to insert in the same way using SPARK SQL hiveContext object, I am getting the folllowing error -
java.lang.RuntimeException: java.lang.RuntimeException: class org.elasticsearch.hadoop.mr.EsOutputFormat$EsOutputCommitter not org.apache.hadoop.mapred.OutputCommitter
Can you please let me know the reason for this error? If it is not possible to insert the same way using Spark, then what is the method to insert into hive elasticsearch index using Spark ?
Versions used - Spark 1.6, Scala 2.10, Elasticsearch 6.4, Hive 1.1

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.

pySpark jdbc write error: An error occurred while calling o43.jdbc. : scala.MatchError: null

I am trying to write simple spark dataframe to db2 database using pySpark. Dataframe has only one column with double as a data type.
This is the dataframe with only one row and one column:
This is the dataframe schema:
When I try to write this dataframe to db2 table with this syntax:
dataframe.write.mode('overwrite').jdbc(url=url, table=source, properties=prop)
it creates the table in the database for the first time without any issue, but if I run the code second time, it throws an exception:
On the DB2 side the column datatype is also DOUBLE.
Not sure what am I missing.
I just changed small part in the code and it worked perfectly.
Here is the small change I did to syntax
dataframe.write.jdbc(url=url, table=source, mode = 'overwrite', properties=prop)

Unable to save dataframe as hive table from spark which is throwing serde exception

I have loaded one of my table in dataframe and trying to save it as hive table
var RddTableName= objHiveContext.sql("select * from tableName")
val dataframeTable = RddTableName.toDF()
dataframeTable.write.format("orc").mode(SaveMode.Overwrite).saveAsTable("test.myTable")
I'm getting below exception
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: cannot find field mytable from [public java.util.ArrayList org.apache.hadoop.hive.serde2.ColumnSet.col]
The above exception is occurring as I was trying to overwrite which looks for existing table "myTable" here(and it is not there ), so to create a new table we have to go for saveMode.Ignore or ErrorIfExists. You can metion Database name in options by mapping its path
first set hcontext.sql("use database")
u cannot put database in saveAsTable