I am using the below to create a dataframe (spark scala) using hive external table. But the dataframe also loaded data in it. I need an empty DF created using hive external table's schema. I am using spark scala for this.
val table1 = sqlContext.table("db.table")
How can I create an empty dataframe using hive external hive table?
You can just do:
val table1 = sqlContext.table("db.table").limit(0)
This will give you the empty df with only the schema. Because of lazy evaluation it also does not take longer than just loading the schema.
Related
I need a way to create a hive table from a Scala dataframe. The hive table should have underlying files in ORC format in S3 location partitioned by date.
Here is what I have got so far:
I write the scala dataframe to S3 in ORC format
df.write.format("orc").partitionBy("date").save("S3Location)
I can see the ORC files in the S3 location.
I create a hive table on the top of these ORC files now:
CREATE EXTERNAL TABLE "tableName"(columnName string)
PARTITIONED BY (date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
But the hive table is empty, i.e.
spark.sql("select * from db.tableName") prints no results.
However, when I remove PARTITIONED BY line:
CREATE EXTERNAL TABLE "tableName"(columnName string, date string)
STORED AS ORC
LOCATION "S3Location"
TBLPROPERTIES("orc.compress"="SNAPPY")
I see results from the select query.
It seems that hive does not recognize the partitions created by spark. I am using Spark 2.2.0.
Any suggestions will be appreciated.
Update:
I am starting with a spark dataframe and I just need a way to create a hive table on the top of this(underlying files being in ORC format in S3 location).
I think the partitions are not added yet to the hive metastore, so u need only to run this hive command :
MSCK REPAIR TABLE table_name
If does not work, may be you need to check these points :
After writing data into s3, folder should be like : s3://anypathyouwant/mytablefolder/transaction_date=2020-10-30
when creating external table, the location should point to s3://anypathyouwant/mytablefolder
And yes, Spark writes data into s3 but does not add the partitions definitions into the hive metastore ! And hive is not aware of data written unless they are under a recognized partition.
So to check what partitions are in the hive metastore, you can use this hive command :
SHOW PARTITIONS tablename
In production environment, i do not recommand using the MSCK REPAIR TABLE for this purpose coz it will be too much time consuming by time. The best way, is to make your code add only the newly created partitions to your metastore through rest api.
Spark version: 2.4.2 on Amazon EMR 5.24.0
I have a Glue Catalog table backed by S3 Parquet directory. The Parquet files have case-sensitive column names (like lastModified). It doesn't matter what I do, I get lowercase column names (lastmodified) when reading the Glue Catalog table with Spark:
for {
i <- Seq(false, true)
j <- Seq("NEVER_INFER", "INFER_AND_SAVE", "INFER_ONLY")
k <- Seq(false, true)
} {
val spark = SparkSession.builder()
.config("spark.sql.hive.convertMetastoreParquet", i)
.config("spark.sql.hive.caseSensitiveInferenceMode", j)
.config("spark.sql.parquet.mergeSchema", k)
.enableHiveSupport()
.getOrCreate()
import spark.sql
val df = sql("""SELECT * FROM ecs_db.test_small""")
df.columns.foreach(println)
}
[1] https://medium.com/#an_chee/why-using-mixed-case-field-names-in-hive-spark-sql-is-a-bad-idea-95da8b6ec1e0
[2] https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
Edit
The below solution is incorrect.
Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the underlying files and alters the tables to add the spark.sql.sources.schema.* properties to SERDEPROPERTIES. In our case, Spark failed to do so, because of a IllegalArgumentException: Can not create a Path from an empty string exception which is caused because the Hive database class instance has an empty locationUri property string. This is caused because the Glue database does not have a Location property . After the schema is saved, Spark reads it from the table.
There could be a way around this, by setting INFER_ONLY, which should only infer the schema from the files and not attempt to alter the table SERDEPROPERTIES. However, this doesn't work because of a Spark bug, where the inferred schema is then lowercased (see here).
Original solution (incorrect)
This bug happens because the Glue table's SERDEPROPERTIES is missing two important properties:
spark.sql.sources.schema.numParts
spark.sql.sources.schema.part.0
To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …)
I guess this is a bug with Glue crawlers, which do not set these properties when creating the table.
I am saving a spark dataframe to a hive table. The spark dataframe is a nested json data structure. I am able to save the dataframe as files but it fails at the point where it creates a hive table on top of it saying
org.apache.spark.SparkException: Cannot recognize hive type string
I cannot create a hive table schema first and then insert into it since the data frame consists of a couple hundreds of nested columns.
So I am saving it as:
df.write.partitionBy("dt","file_dt").saveAsTable("df")
I am not able to debug what the issue this.
The issue I was having was to do with a few columns which were named as numbers "1","2","3". Removing such columns in the dataframe let me create a hive table without any errors.
The code below is how it was written into HDFS using scala. What is the HQL syntax to create a Hive table to query this data?
import com.databricks.spark.avro._
val path = "/user/myself/avrodata"
dataFrame.write.avro(path)
The examples I find require providing an avro.schema.literal to describe the schema or an avro.schema.url to the actual avro schema.
In the spark-shell all I would need to do to read this is:
scala> import com.databricks.spark.avro._
scala> val df = sqlContext.read.avro("/user/myself/avrodata")
scala> df.show()
So I cheated to get this to work. Basically I created a temporary table and used HQL to create and insert the data from the temp table. This method uses the metadata from the temporary table and creates the avro target table which I wanted to create and populate. If the data frame can create a temporary table from its schema, why could it not save the table as avro?
dataFrame.registerTempTable("my_tmp_table")
sqlContext.sql(s"create table ${schema}.${tableName} stored as avro as select * from ${tmptbl}")
I am trying to create Hive external table from Spark application and passing location as a variable to the SQL command. It doesn't create Hive table and I don't see any errors.
val location = "/home/data"
hiveContext.sql(s"""CREATE EXTERNAL TABLE IF NOT EXISTS TestTable(id STRING,name STRING) PARTITIONED BY (city string) STORED AS PARQUET LOCATION '${location}' """)
Spark only supports creating managed tables. And even then there are severe restrictions: it does not support dynamically partitioned tables.
TL;DR you can not create external tables through Spark. Spark can read them
Not sure which version had this limitations.
I using Spark 1.6, Hive 1.1.
I am able to create the external table, please follow below:
var query = "CREATE EXTERNAL TABLE avro_hive_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'TBLPROPERTIES ('avro.schema.url'='hdfs://localdomain/user/avro/schemas/activity.avsc') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/user/avro/applog_avro'"
var hiveContext = new org.apache.spark.sql.hive.HiveContext(sc);
hiveContext.sql(query);
var df = hiveContext.sql("select count(*) from avro_hive_table");