How to fix incompatible Parquet schema error in Impala

How to fix incompatible Parquet schema error in Impala - pyspark

Experts, i am querying a Hive external table (Data stored as Parquet on HDFS) without any issue using Hive/Pyspark. The same query when run on Impala is giving me "Incompatible Parquet Schema" error. The field is defined as 'String' in Hive DDL and only value it has is '000'.
Error-
File 'hdfs://home/user1/out/STATEMENTS/effective_dt=20200804/part-00007-79ea7da3-2165-4a92-9c30-a11e415e778c.c000' has an incompatible Parquet schema for column 'db.statements.profit'. Column type: CHAR(3), Parquet schema: optional int32 os_chargeoff [i:144 d:1 r:0]
Does anyone know how to fix this?
I tried different things - Refresh Table/Invalidate Metadata/Drop recreate Table without success.

Related

Athena type INT64 in parquet is incompatible with type double defined in table schema

I have converted data from csv to parquet file format using pyspark infer schema and tried to read data using Athena.
df.printSchehma()
test_num : double (nullable = true)
Athena also using double data type when we create table with Glue crawler, we cant query table with below issue.
Error:
test_num : type INT64 in parquet is incompatible with type double defined in table schema
Any suggestions to resolve this issue ? appreciate your help.

Copy from S3 AVRO file to Table in Redshift Results in All Null Values

I am trying to copy an AVRO file that is stored in S3 to a table I created in Redshift and I am getting all null values. However, the AVRO file does not have null values in it. I see the following error when I look at the log: "Missing newline: Unexpected character 0x79 found at location 9415"
I did some research online and the only post I could find said that values would be null if the column name case in the target table did not match the source file. I have ensured the case for the column in the target table is the same as the source file.
Here is mock snippet from the AVRO file:
Objavro.schemaƒí{"type":"record","name":"something","fields":[{"name":"g","type":["string","null"]},{"name":"stuff","type":["string","null"]},{"name":"stuff","type":["string","null"]}
Here is the sql code I am using in Redshift:
create table schema.table_name (g varchar(max));
copy schema.table_name
from 's3://bucket/folder/file.avro'
iam_role 'arn:aws:iam::xxxxxxxxx:role/xx-redshift-readonly'
format as avro 'auto';
I am expecting to see a table with one column called g where each row has the value stuff.

SparkSQL/JDBC error com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #7: Cannot find data type BLOB

Saving DataFrame to table with VARBINARY columns is throwing error:
com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or
variable #7: Cannot find data type BLOB
If I try to use VARBINARY in createTableColumnTypes option, I get "VARBINARY not supported".
Workaround is:
change TARGET schema to use VARCHAR.
Add .option("createTableColumnTypes", "Col1 varchar(500), Col2) varchar(500)")
While this workaround lets us go ahead with saving rest of data, actual binary data from source table (from where Data is read) is not saved correctly for these 2 columns - we see NULL data.
We are using MS SQL Server 2017 JDBC driver and Spark 2.3.2.
Any help, workaround to address this issue correctly so that we don't lose data is appreciated.

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.

The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.

Insert Spark Dataframe into Hive Table

I have a Hive Table existing, stored as orc, that has the same schema as the data frame I create with my spark job. If I save the data frame down as css, json, text, whatever then it works just fine. I can hand migrate these files into a Hive table just fine.
but when I try and insert directly into hive with
df.insertInto("table_name", true)
I get this error in the YARN UI:
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'cast(last_name_2 as array<double>)' due to data type mismatch: cannot cast StringType to ArrayType(DoubleType,true);
I've also tried register a temp table before calling insert and also using:
df.write.mode(SaveMode.Append).saveAsTable("default.table_name")
What am I doing wrong?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to fix incompatible Parquet schema error in Impala - pyspark

Related

Athena type INT64 in parquet is incompatible with type double defined in table schema

Copy from S3 AVRO file to Table in Redshift Results in All Null Values

SparkSQL/JDBC error com.microsoft.sqlserver.jdbc.SQLServerException: Column, parameter, or variable #7: Cannot find data type BLOB

PySpark - Saving Hive Table - org.apache.spark.SparkException: Cannot recognize hive type string

Insert Spark Dataframe into Hive Table

Categories

Resources