Create Hive Table from Spark using API, rather than SQL? - scala

I want to create a hive table with partitions.
The schema for the table is:
val schema = StructType(StructField(name,StringType,true),StructField(age,IntegerType,true))
I can do this with Spark-SQL using:
val query = "CREATE TABLE some_new_table (name string, age integer) USING org.apache.spark.sql.parquet OPTIONS (path '<some_path>') PARTITIONED BY (age)"
spark.sql(query)
When I try to do with Spark API (using Scala), the table is filled with data. I only want to create an empty table and define partitions. This is what I am doing, what I am doing wrong :
val df = spark.createDataFrame(sc.emptyRDD[Row], schema)
val options = Map("path" -> "<some_path>", "partitionBy" -> "age")
df.sqlContext().createExternalTable("some_new_table", "org.apache.spark.sql.parquet", schema, options);
I am using Spark-2.1.1.

If you skip partitioning. can try with saveAsTable:
spark.createDataFrame(sc.emptyRDD[Row], schema)
.write
.format("parquet")
//.partitionBy("age")
.saveAsTable("some_new_table")
Spark partitioning and Hive partitioning are not compatible, so if you want access from Hive you have to use SQL: https://issues.apache.org/jira/browse/SPARK-14927

Related

Not able to insert Value using SparkSql

I need to insert some values in my hive table using sparksql.I'm using below code.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
First I tried using Insert Into Values but then i found this feature is not available in sparksql.
val ds=sparksession.sql("insert into mytable("filepath,filename,Start_Time") values('${filepath}','${fn}','${e}')
is there any other way to insert these values using sparksql(mytable is empty,I need to load this table everyday)?.
You can directly use Spark Dataframe Write API to insert data into the table.
If you do not have the Spark Dataframe then first create one Dataframe using spark.createDataFrame() then, try as follow to write the data:
df.write.insertInto("name of hive table")
Hi Below code worked for me since i need to use variable in my dataframe so first i created dataframe form selected data then using df.write.insertInto(tablename) saved in hive table.
val filepath:String = "/user/usename/filename.csv'"
val fileName : String = filepath
val result = fileName.split("/")
val fn=result(3) //filename
val e=LocalDateTime.now() //timestamp
val df1=sparksession.sql(s" select '${filepath}' as file_path,'${fn}' as filename,'${e}' as Start_Time")
df1.write.insertInto("dbname.tablename")

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT
Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

How to compare the schema of a dataframe read from an RDBMS table against the same table on Hive?

I created a dataframe by reading an RDBMS table from postgres as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
execQuery content: select qtd_balance_text,ytd_balance_text,del_flag,source_system_name,period_year from dbname.hrtable;
This is the schema of my final dataframe:
println(yearDF.schema)
StructType(StructField(qtd_balance_text,StringType,true), StructField(ytd_balance_text,StringType,true), StructField(del_flag,IntegerType,true), StructField(source_system_name,StringType,true), StructField(period_year,DecimalType(15,0),true))
There is a table on Hive with same name: hrtable and same column names. Before ingesting the data into the Hive table, I want to keep a check in the code to see if the schema of GP & Hive tables are same.
I was able to access the schema as following:
spark.sql("desc formatted databasename.hrtable").collect.foreach(println)
But the problem is it collects the schema in a different way
[ qtd_balance_text,bigint,null]
[ ytd_balance_text,string,null]
[ del_flag,string,null]
[source_system_name,bigint,null]
[ period_year,bigint,null]
[Type,MANAGED,]
[Provider,hive,]
[Table Properties,[orc.stripe.size=536870912, transient_lastDdlTime=1523914516, last_modified_time=1523914516, last_modified_by=username, orc.compress.size=268435456, orc.compress=ZLIB, serialization.null.format=null],]
[Location,hdfs://devenv/apps/hive/warehouse/databasename.db/hrtable,]
[Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Storage Properties,[serialization.format=1],]
[Partition Provider,Catalog,]
Clearly I cannot the schemas which are present in this way and I couldn't understand how to do it.
Could anyone let me know how to properly compare the schema of dataframe yearDF and the hive table: hrtable?
Instead of parsing of the Hive tables schema output, you can try this option.
Read Hive table also as dataframe. Assume this dataframe is df1 and your yearDF as df2. Then compare schemas as below.
If there is a possibility of column count also differs between two dataframes, then keep additional df1.size == df2.size comparison if loop also.
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")

How to add a new column to an existing dataframe while also specifying the datatype of it?

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. This column is used to mark a primary-key whether the row is deleted in the source table or not.
To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column.
I have written the StructType for the new column as:
val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))
But I don't understand how to add this column with the existing dataFrame: yearDF. Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ?
import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()
Though casting is not required if you are passing lit(1) as spark will infer the schema for you. But if you are passing as lit("1") it will cast it to Int

How to create a RDD from RC file using data which is partitioned in the hive table

CREATE TABLE employee_details(
emp_first_name varchar(50),
emp_last_name varchar(50),
emp_dept varchar(50)
)
PARTITIONED BY (
emp_doj varchar(50),
emp_dept_id int )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';
Location of the hive table stored is /data/warehouse/employee_details
I have a hive table employee loaded with data and is partitioned by emp_doj ,emp_dept_id and FileFormat is RC file format.
I would like to process the data in the table using the spark-sql without using the hive-context(simply using sqlContext).
Could you please help me in how to load partitioned data of the hive table into an RDD and convert to DataFrame
If you are using Spark 2.0, you can do it in this way.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
// Queries are expressed in HiveQL
sql("SELECT * FROM src").show()