Creating table from parquet files - scala

I am facing an issue on my inserting data.
In fact, I read some csv's files in a dataFrame and store the dataFrame on HDFS like :
val data = spark.read.option("header", "true").option("delimiter", ",").csv("/path_to_csv//*.csv")
data.repartition($"year", $"month", $"day").write.partitionBy("year", "month", "day").mode("overwrite").option("header", "true").option("delimiter", ",").parquet ("/path/to/parquet")
Then I created an external on my stored parquet like :
create external table tab (col1 string, col2 string, col3 int)
partitioned by (year int,month int,day int) stored as parquet
LOCATION 'hdfs://path/to/parquet'
Till here it is OK! But when I do a request on my table :
select * from tab
I have no result.
Does anybody face this issue?
Thanks.

Related

pyspark + hive: difference between first row in dataframe and table

I created a table in Hive using a csv file containing a header:
CREATE TABLE resultado(
data_jogo date,
mandante string,
visitante string,
gols_mandante int,
gols_visitante int,
torneio string,
cidade string,
pais string,
campo_neutro boolean
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH '/user/hive/projeto/results.csv' OVERWRITE INTO TABLE resultado;
And works fine when a try a select:
SELECT * FROM resultado LIMIT 5;
Then I went to pyspark to see the same data:
from pyspark.sql import HiveContext
h = HiveContext(sc)
df = h.table('resultado')
df.show(5)
But it returns a dataframe with the header from file loaded in the table.
Please, can someone tell me what I'm doing wrong? As you can see I'm really new into this xD

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like this:
DeltaTable.forPath(spark, deltaPath)
.as("dest_table")
.merge(myDF.as("source_table"),
"dest_table.id = source_table.id")
.whenNotMatched()
.insertAll()
.whenMatched(upsertCond)
.updateExpr(upsertStat)
.execute()
From these docs, it looks like Delta Lake supports adding new columns on insertAll() and updateAll() calls only. However, I'm updating only when certain conditions are met and want the new column added to all the existing data (with a default value of null).
I've come up with a solution that seems extremely clunky and am wondering if there's a more elegant approach. Here's my current proposed solution:
// Read in existing data
val myData = spark.read.format("delta").load(deltaPath)
// Register table with Hive metastore
myData.write.format("delta").saveAsTable("input_data")
// Add new column
spark.sql("ALTER TABLE input_data ADD COLUMNS (new_col string)")
// Save as DataFrame and overwrite data on disk
val sqlDF = spark.sql("SELECT * FROM input_data")
sqlDF.write.format("delta").option("mergeSchema", "true").mode("overwrite").save(deltaPath)
Alter your delta table first and then you do your merge operation:
from pyspark.sql.functions import lit
spark.read.format("delta").load('/mnt/delta/cov')\
.withColumn("Recovered", lit(''))\
.write\
.format("delta")\
.mode("overwrite")\
.option("overwriteSchema", "true")\
.save('/mnt/delta/cov')
New columns can also be added with SQL commands as follows:
ALTER TABLE dbName.TableName ADD COLUMNS (newColumnName dataType)
UPDATE dbName.TableName SET newColumnName = val;
This is the approach that worked for me using scala
Having a delta table, named original_table, which path is:
val path_to_delta = "/mnt/my/path"
This table currently has got 1M records with the following schema: pk, field1, field2, field3, field4
I want to add a new field, named new_field, to the existing schema without loosing the data already stored in original_table.
So I first created a dummy record with a simple schema containing just pk and newfield
case class new_schema(
pk: String,
newfield: String
)
I created a dummy record using that schema:
import spark.implicits._
val dummy_record = Seq(new new_schema("delete_later", null)).toDF
I inserted this new record (the existing 1M records will have newfield populated as null). I also removed this dummy record from the original table:
dummy_record
.write
.format("delta")
.option("mergeSchema", "true")
.mode("append")
.save(path_to_delta )
val original_dt : DeltaTable = DeltaTable.forPath(spark, path_to_delta )
original_dt .delete("pk = 'delete_later'")
Now the original table will have 6 fields: pk, field1, field2, field3, field4 and newfield
Finally I upsert the newfield values in the corresponding 1M records using pk as join key
val df_with_new_field = // You bring new data from somewhere...
original_dt
.as("original")
.merge(
df_with_new_field .as("new"),
"original.pk = new.pk")
.whenMatched
.update( Map(
"newfield" -> col("new.newfield")
))
.execute()
https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Have you tried using the merge statement?
https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html

How to delete data from Hive external table for Non-Partition column?

I have created an external table in Hive partitioned by client and month.
The requirement asks to delete the data for ID=201 from that table but it's not partitioned by the ID column.
I have tried to do with Insert Overwrite but it's not working.
We are using Spark 2.2.0.
How can I solve this problem?
val sqlDF = spark.sql("select * from db.table")
val newSqlDF1 = sqlDF.filter(!col("ID").isin("201") && col("month").isin("062016"))
val columns = newSqlDF1.schema.fieldNames.mkString(",")
newSqlDF1.createOrReplaceTempView("myTempTable") --34
spark.sql(s"INSERT OVERWRITE TABLE db.table PARTITION(client, month) select ${columns} from myTempTable")

Spark- Load data frame contents in table in a loop

I use scala/ spark to insert data into a Hive parquet table as follows
for(*lots of current_Period_Id*){//This loop is on a result of another query that returns multiple rows of current_Period_Id
val myDf = hiveContext.sql(s"""SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
val count: Int = myDf.count().toInt
if(count>0){
hiveContext.sql(s"""INSERT INTO destinationtable PARTITION(period_id=$current_Period_Id) SELECT columns FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id""")
}
}
This approach takes a lot of time to complete because the select statement is being executed twice.
I'm trying to avoid selecting data twice and one way I've thought of is writing the dataframe myDf to the table directly.
This is the gist of the code I'm trying to use for the purpose
val sparkConf = new SparkConf().setAppName("myApp")
.set("spark.yarn.executor.memoryOverhead","4096")
val sc = new SparkContext(sparkConf)
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.exec.dynamic.partition","true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
for(*lots of current_Period_Id*){//This loop is on a result of another query
val myDf = hiveContext.sql("SELECT COLUMNS FROM MULTIPLE TABLES WHERE period_id=$current_Period_Id")
val count: Int = myDf.count().toInt
if(count>0){
myDf.write.mode("append").format("parquet").partitionBy("PERIOD_ID").saveAsTable("destinationtable")
}
}
But I get an error in the myDf.write part.
java.util.NoSuchElementException: key not found: period_id
The destination table is partitioned by period_id.
Could someone help me with this?
The spark version I'm using is 1.5.0-cdh5.5.2.
The dataframe schema and table's description differs from each other. The PERIOD_ID != period_id column name is Upper case in your DF but in UPPER case in table. Try in sql with lowercase the period_id

Creating hive table using parquet file metadata

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.
Output from writing parquet write
_common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS
_metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet
Hive table
CREATE TABLE testhive
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/home/gz_files/result';
FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified
How can I infer the meta data from parquet file?
If I open the _common_metadata I have below content,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadataâ–’{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}
Or how to parse meta data file?
Here's a solution I've come up with to get the metadata from parquet files in order to create a Hive table.
First start a spark-shell (Or compile it all into a Jar and run it with spark-submit, but the shell is SOO much easier)
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
val df=sqlContext.parquetFile("/path/to/_common_metadata")
def creatingTableDDL(tableName:String, df:DataFrame): String={
val cols = df.dtypes
var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
//looks at the datatypes and columns names and puts them into a string
val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
ddl1
}
val test_tableDDL=creatingTableDDL("test_table",df,"test_db")
It will provide you with the datatypes that Hive will use for each column as they are stored in Parquet.
E.G: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'
I'd just like to expand on James Tobin's answer. There's a StructField class which provides Hive's data types without doing string replacements.
// Tested on Spark 1.6.0.
import org.apache.spark.sql.DataFrame
def dataFrameToDDL(dataFrame: DataFrame, tableName: String): String = {
val columns = dataFrame.schema.map { field =>
" " + field.name + " " + field.dataType.simpleString.toUpperCase
}
s"CREATE TABLE $tableName (\n${columns.mkString(",\n")}\n)"
}
This solves the IntegerType problem.
scala> val dataFrame = sc.parallelize(Seq((1, "a"), (2, "b"))).toDF("x", "y")
dataFrame: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> print(dataFrameToDDL(dataFrame, "t"))
CREATE TABLE t (
x INT,
y STRING
)
This should work with any DataFrame, not just with Parquet. (e.g., I'm using this with a JDBC DataFrame.)
As an added bonus, if your target DDL supports nullable columns, you can extend the function by checking StructField.nullable.
A small improvement over Victor (adding quotes on field.name) and modified to bind the table to a local parquet file (tested on spark 1.6.1)
def dataFrameToDDL(dataFrame: DataFrame, tableName: String, absFilePath: String): String = {
val columns = dataFrame.schema.map { field =>
" `" + field.name + "` " + field.dataType.simpleString.toUpperCase
}
s"CREATE EXTERNAL TABLE $tableName (\n${columns.mkString(",\n")}\n) STORED AS PARQUET LOCATION '"+absFilePath+"'"
}
Also notice that:
A HiveContext is needed since SQLContext does not support creating
external table.
The path to the parquet folder must be an absolute path
I would like to expand James answer,
The following code will work for all datatypes including ARRAY, MAP and STRUCT.
Have tested in SPARK 2.2
val df=sqlContext.parquetFile("parquetFilePath")
val schema = df.schema
var columns = schema.fields
var ddl1 = "CREATE EXTERNAL TABLE " tableName + " ("
val cols=(for(column <- columns) yield column.name+" "+column.dataType.sql).mkString(",")
ddl1=ddl1+cols+" ) STORED AS PARQUET LOCATION '/tmp/hive_test1/'"
spark.sql(ddl1)
I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:
http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1
For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).
Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation"
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging
So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from
$ parquet-tools schema /home/gz_files/result/000000_0
Actually, Impala supports
CREATE TABLE LIKE PARQUET
(no columns section altogether):
https://docs.cloudera.com/runtime/7.2.15/impala-sql-reference/topics/impala-create-table.html
Tags of your question have "hive" and "spark" and I don't see this is implemented in Hive, but in case you use CDH, it may be what you were looking for.