Spark version: 2.4.2 on Amazon EMR 5.24.0
I have a Glue Catalog table backed by S3 Parquet directory. The Parquet files have case-sensitive column names (like lastModified). It doesn't matter what I do, I get lowercase column names (lastmodified) when reading the Glue Catalog table with Spark:
for {
i <- Seq(false, true)
j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
k <- Seq(false, true)
} {
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet", i)
.config("spark.sql.hive.caseSensitiveInferenceMode", j)
.config("spark.sql.parquet.mergeSchema", k)
.enableHiveSupport()
.getOrCreate()
import spark.sql
val df = sql("""SELECT * FROM ecs_db.test_small""")
df.columns.foreach(println)
}
[1] https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Edit
The below solution is incorrect.
Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our case, Spark failed to do so, because of a IllegalArgumentException: Can not create a Path from an empty string exception which is caused because the Hive database class instance has an empty locationUri property string. This is caused because the Glue database does not have a Location property . After the schema is saved, Spark reads it from the table.
There could be a way around this, by setting INFER_ONLY, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased (see here).
Original solution (incorrect)
This bug happens because the Glue table's SERDEPROPERTIES is missing two important properties:
spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0
To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …)
I guess this is a bug with Glue crawlers, which do not set these properties when creating the table.
Related
Tried to create a delta table from spark data frame using below command:
destination_path = "/dbfs/mnt/kidneycaredevstore/delta/df_corr_feats_spark_4"
df_corr_feats_spark.write.format("delta").option("delta.columnMapping.mode", "name").option("path",destination_path).saveAsTable("CKD_Features_4")
Getting below error:
AnalysisException: Cannot create a table having a column whose name contains commas in Hive metastore. Table: default.abc_features_4; Column: Adverse, abc initial encounter
Please note that there are around 6k columns in this data frame and it is developed by data scientist generate feature. So, we cannot rename columns.
How to fix this error? Can any configuration change in Metastore solve this issue?
Column mapping feature requires writer version 5 and reader version 2 of Delta protocol, therefore you need to specify this when saving:
df_corr_feats_spark.write.format("delta")
.option("delta.columnMapping.mode", "name")
.option("delta.minReaderVersion", "2")
.option("delta.minWriterVersion", "5")
.option("path", destination_path)
.saveAsTable("CKD_Features_4")
I am using the below to create a dataframe (spark scala) using hive external table. But the dataframe also loaded data in it. I need an empty DF created using hive external table's schema. I am using spark scala for this.
val table1 = sqlContext.table("db.table")
How can I create an empty dataframe using hive external hive table?
You can just do:
val table1 = sqlContext.table("db.table").limit(0)
This will give you the empty df with only the schema. Because of lazy evaluation it also does not take longer than just loading the schema.
I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.
The code below is how it was written into HDFS using scala. What is the HQL syntax to create a Hive table to query this data?
import com.databricks.spark.avro._
val path = "/user/myself/avrodata"
dataFrame.write.avro(path)
The examples I find require providing an avro.schema.literal to describe the schema or an avro.schema.url to the actual avro schema.
In the spark-shell all I would need to do to read this is:
scala> import com.databricks.spark.avro._
scala> val df = sqlContext.read.avro("/user/myself/avrodata")
scala> df.show()
So I cheated to get this to work. Basically I created a temporary table and used HQL to create and insert the data from the temp table. This method uses the metadata from the temporary table and creates the avro target table which I wanted to create and populate. If the data frame can create a temporary table from its schema, why could it not save the table as avro?
dataFrame.registerTempTable("my_tmp_table")
sqlContext.sql(s"create table ${schema}.${tableName} stored as avro as select * from ${tmptbl}")
I am trying to create Hive external table from Spark application and passing location as a variable to the SQL command. It doesn't create Hive table and I don't see any errors.
val location = "/home/data"
hiveContext.sql(s"""CREATE EXTERNAL TABLE IF NOT EXISTS TestTable(id STRING,name STRING) PARTITIONED BY (city string) STORED AS PARQUET LOCATION '${location}' """)
Spark only supports creating managed tables. And even then there are severe restrictions: it does not support dynamically partitioned tables.
TL;DR you can not create external tables through Spark. Spark can read them
Not sure which version had this limitations.
I using Spark 1.6, Hive 1.1.
I am able to create the external table, please follow below:
var query = "CREATE EXTERNAL TABLE avro_hive_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES ('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/avro/applog_avro'"
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext.sql(query);
var df = hiveContext.sql("select count(*) from avro_hive_table");