How to compare the schema of a dataframe read from an RDBMS table against the same table on Hive?

How to compare the schema of a dataframe read from an RDBMS table against the same table on Hive? - scala

I created a dataframe by reading an RDBMS table from postgres as below:
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
execQuery content: select qtd_balance_text,ytd_balance_text,del_flag,source_system_name,period_year from dbname.hrtable;
This is the schema of my final dataframe:
println(yearDF.schema)
StructType(StructField(qtd_balance_text,StringType,true), StructField(ytd_balance_text,StringType,true), StructField(del_flag,IntegerType,true), StructField(source_system_name,StringType,true), StructField(period_year,DecimalType(15,0),true))
There is a table on Hive with same name: hrtable and same column names. Before ingesting the data into the Hive table, I want to keep a check in the code to see if the schema of GP & Hive tables are same.
I was able to access the schema as following:
spark.sql("desc formatted databasename.hrtable").collect.foreach(println)
But the problem is it collects the schema in a different way
[ qtd_balance_text,bigint,null]
[ ytd_balance_text,string,null]
[ del_flag,string,null]
[source_system_name,bigint,null]
[ period_year,bigint,null]
[Type,MANAGED,]
[Provider,hive,]
[Table Properties,[orc.stripe.size=536870912, transient_lastDdlTime=1523914516, last_modified_time=1523914516, last_modified_by=username, orc.compress.size=268435456, orc.compress=ZLIB, serialization.null.format=null],]
[Location,hdfs://devenv/apps/hive/warehouse/databasename.db/hrtable,]
[Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Storage Properties,[serialization.format=1],]
[Partition Provider,Catalog,]
Clearly I cannot the schemas which are present in this way and I couldn't understand how to do it.
Could anyone let me know how to properly compare the schema of dataframe yearDF and the hive table: hrtable?

Instead of parsing of the Hive tables schema output, you can try this option.
Read Hive table also as dataframe. Assume this dataframe is df1 and your yearDF as df2. Then compare schemas as below.
If there is a possibility of column count also differs between two dataframes, then keep additional df1.size == df2.size comparison if loop also.
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")

Related

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT

Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

How to add a new column to an existing dataframe while also specifying the datatype of it?

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. This column is used to mark a primary-key whether the row is deleted in the source table or not.
To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column.
I have written the StructType for the new column as:
val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))
But I don't understand how to add this column with the existing dataFrame: yearDF. Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ?

import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()
Though casting is not required if you are passing lit(1) as spark will infer the schema for you. But if you are passing as lit("1") it will cast it to Int

How to fix org.apache.spark.sql.AnalysisException while changing the order of columns in a dataframe?

I am trying to load data from an RDBMS table on Postgres to Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${query}) as year2017")
.option("user", devUserName).option("password", devPassword)
.option("numPartitions",15).load()
The Hive table is dynamically partitioned based on two columns: source_system_name,period_year
I have these column names present in a metadata table: metatables
val spColsDF = spark.read.format("jdbc").option("url",hiveMetaConURL)
.option("dbtable", "(select partition_columns from metainfo.metatables where tablename='finance.xx_gl_forecast') as colsPrecision")
.option("user", metaUserName)
.option("password", metaPassword)
.load()
I am trying to move the partition columns: source_system_name, period_year to the end of the dataFrame: yearDF because the columns that are used in Hive dynamic partitioning should be at the end.
To do that, I came up with the following logic:
val partition_columns = spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
val allColsOrdered = yearDF.columns.diff(partition_columns) ++ partition_columns
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
When I execute the code, I get the exception:org.apache.spark.sql.AnalysisException as below:
Exception in thread "main" 18/08/28 18:09:30 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
org.apache.spark.sql.AnalysisException: cannot resolve '`source_system_name,period_year`' given input columns: [cost_center, period_num, period_name, currencies, cc_channel, scenario, xx_pk_id, period_year, cc_region, reference_code, source_system_name, source_record_type, xx_last_update_tms, xx_last_update_log_id, book_type, cc_function, product_line, ptd_balance_text, project, ledger_id, currency_code, xx_data_hash_id, qtd_balance_text, pl_market, version, qtd_balance, period, ptd_balance, ytd_balance_text, xx_hvr_last_upd_tms, geography, year, del_flag, trading_partner, ytd_balance, xx_data_hash_code, xx_creation_tms, forecast_id, drm_org, account, business_unit, gl_source_name, gl_source_system_name];;
'Project [forecast_id#26L, period_year#27, period_num#28, period_name#29, drm_org#30, ledger_id#31L, currency_code#32, source_system_name#33, source_record_type#34, gl_source_name#35, gl_source_system_name#36, year#37, period#38, scenario#39, version#40, currencies#41, business_unit#42, account#43, trading_partner#44, cost_center#45, geography#46, project#47, reference_code#48, product_line#49, ... 20 more fields]
+- Relation[forecast_id#26L,period_year#27,period_num#28,period_name#29,drm_org#30,ledger_id#31L,currency_code#32,source_system_name#33,source_record_type#34,gl_source_name#35,gl_source_system_name#36,year#37,period#38,scenario#39,version#40,currencies#41,business_unit#42,account#43,trading_partner#44,cost_center#45,geography#46,project#47,reference_code#48,product_line#49,... 19 more fields] JDBCRelation((select forecast_id,period_year,period_num,period_name,drm_org,ledger_id,currency_code,source_system_name,source_record_type,gl_source_name,gl_source_system_name,year,period,scenario,version,currencies,business_unit,account,trading_partner,cost_center,geography,project,reference_code,product_line,book_type,cc_region,cc_channel,cc_function,pl_market,ptd_balance,qtd_balance,ytd_balance,xx_hvr_last_upd_tms,xx_creation_tms,xx_last_update_tms,xx_last_update_log_id,xx_data_hash_code,xx_data_hash_id,xx_pk_id,null::integer as del_flag,ptd_balance::character varying as ptd_balance_text,qtd_balance::character varying as qtd_balance_text,ytd_balance::character varying as ytd_balance_text from analytics.xx_gl_forecast where period_year='2017') as year2017) [numPartitions=1]
But if I pass the same column names in another way as following, the code works fine:
val lastCols = Seq("source_system_name","period_year")
val allColsOrdered = yearDF.columns.diff(lastCols) ++ lastCols
val allCols = allColsOrdered.map(coln => org.apache.spark.sql.functions.col(coln))
val resultDF = yearDF.select(allCols:_*)
Could anyone tell me what is the mistake I am doing here ?

If you look at the error:
cannot resolve '`source_system_name,period_year`
It means that, the following line:
spColsDF.select("partition_columns").collect().map(_.getString(0)).toSeq
is returning something like:
Array("source_system_name,period_year")
that means that both the column names are concatenated and form the first element of the array instead of seperate elements like you want.
To get the desired result, you need to split it on ,. For eg, the following should work.
spColsDf.select("partition_columns").collect.flatMap(_.getAs[String](0).split(","))

Spark DF to Hive ORC table - Map type column

I am trying to write a Map type column from spark DF to Hive Orc table but its failing with errors "Not matching column type"
Hive Table:
CREATE EXTERNAL TABLE `default.test_map_col`(
test_col Map<String,String>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/hdfs/path'
TBLPROPERTIES (
'serialization.null.format'='')
Schema used to create DF
schema = StructType(Seq(StructField("map_key_values", MapType(StringType, StringType), nullable = true)))
Map column populated in the inputDF as
("map_key_values", map(lit("testkey"), lit("testval"))
I also tried using a UDF to populate the map column in DF as
val toStruct = udf((c1: Map[String, String]) => c1.map {
case (k, v) => k + "\u0003" + v}.toSeq)
Any thoughts on how to write this?

Create Hive Table from Spark using API, rather than SQL?

I want to create a hive table with partitions.
The schema for the table is:
val schema = StructType(StructField(name,StringType,true),StructField(age,IntegerType,true))
I can do this with Spark-SQL using:
val query = "CREATE TABLE some_new_table (name string, age integer) USING org.apache.spark.sql.parquet OPTIONS (path '<some_path>') PARTITIONED BY (age)"
spark.sql(query)
When I try to do with Spark API (using Scala), the table is filled with data. I only want to create an empty table and define partitions. This is what I am doing, what I am doing wrong :
val df = spark.createDataFrame(sc.emptyRDD[Row], schema)
val options = Map("path" -> "<some_path>", "partitionBy" -> "age")
df.sqlContext().createExternalTable("some_new_table", "org.apache.spark.sql.parquet", schema, options);
I am using Spark-2.1.1.

If you skip partitioning. can try with saveAsTable:
spark.createDataFrame(sc.emptyRDD[Row], schema)
.write
.format("parquet")
//.partitionBy("age")
.saveAsTable("some_new_table")
Spark partitioning and Hive partitioning are not compatible, so if you want access from Hive you have to use SQL: https://issues.apache.org/jira/browse/SPARK-14927

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to compare the schema of a dataframe read from an RDBMS table against the same table on Hive? - scala

Related

Spark dataframe cast column for Kudu compatibility

How to add a new column to an existing dataframe while also specifying the datatype of it?

How to fix org.apache.spark.sql.AnalysisException while changing the order of columns in a dataframe?

Spark DF to Hive ORC table - Map type column

Create Hive Table from Spark using API, rather than SQL?

Categories

Resources