Spark apply custom schema to a DataFrame - scala

I have a data in Parquet file and want to apply custom schema to it.
My initial data within Parquet is as below,
root
|-- CUST_ID: decimal(9,0) (nullable = true)
|-- INACTV_DT: string (nullable = true)
|-- UPDT_DT: string (nullable = true)
|-- ACTV_DT: string (nullable = true)
|-- PMT_AMT: decimal(9,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = true)
My custom schema is below,
root
|-- CUST_ID: decimal(38,0) (nullable = false)
|-- INACTV_DT: timestamp (nullable = false)
|-- UPDT_DT: timestamp (nullable = false)
|-- ACTV_DT: timestamp (nullable = true)
|-- PMT_AMT: decimal(19,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = false)
Below is my code to apply new data-frame to it
val customSchema = getOracleDBSchema(sparkSession, QUERY).schema
val DF_frmOldParkquet = sqlContext_par.read.parquet("src/main/resources/data_0_0_0.parquet")
val rows: RDD[Row] = DF_frmOldParkquet.rdd
val newDataFrame = sparkSession.sqlContext.createDataFrame(rows, tblSchema)
newDataFrame.printSchema()
newDataFrame.show()
I am getting below error, when I perform this operation.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of timestamp
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), fromDecimal, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, CUST_ID), DecimalType(38,0)), true) AS CUST_ID#27

There are two main applications of schema in Spark SQL
schema argument passed to schema method of the DataFrameReader which is used to transform data in some formats (primarily plain text files). In this case schema can be used to automatically cast input records.
schema argument passed to createDataFrame (variants which take RDD or List of Rows) of the SparkSession. In this case schema has to conform to the data, and is not used for casting.
None of the above is applicable in your case:
Input is strongly typed, therefore schema, if present, is ignored by the reader.
Schema doesn't match the data, therefore it cannot be used to createDataFrame.
In this scenario you should cast each column to the desired type. Assuming the types are compatible, something like this should work
val newDataFrame = df.schema.fields.foldLeft(df){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
Depending on the format of the data this might be sufficient or not. For example if fields that should be transformed to timestamps don't use standard formatting, casting won't work, and you'll have to use Spark datetime processing utilities.

Related

Error while inserting into partitioned hive table for spark scala

I am having hive table with following structure
CREATE TABLE gcganamrswp_work.historical_trend_result(
column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
I am having spark dataframe with schema
root
|-- metric_name: string (nullable = true)
|-- column_name: string (nullable = true)
|-- Lower_Threshold: double (nullable = true)
|-- Upper_Threshold: double (nullable = true)
|-- Current_Percentage: double (nullable = true)
|-- Calc_Status: string (nullable = false)
|-- Final_Status: string (nullable = false)
|-- support_override: string (nullable = false)
|-- Dataset_Name: string (nullable = false)
|-- insert_timestamp: string (nullable = false)
|-- appId: string (nullable = false)
|-- currentDate: string (nullable = false)
|-- indicator: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- appname: string (nullable = false)
|-- year_month: string (nullable = false)
when i try to insert into hive table using below code it is failing
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
data_df.repartition(1)
.write.mode("append")
.format("hive")
.insertInto(Outputhive_table)
Spark Version : Spark 2.4.0
Error:
ERROR Hive:1987 - Exception when loading partition with parameters
partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0,
table=historical_trend_result, partSpec={appname=, year_month=},
replace=false, listBucketingEnabled=false, isAcid=false,
hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at
org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at
org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at
org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly
Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.
While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.
Something like this should work
data_df.repartition(1)
.write
.partitionBy("appname", "year_month")
.mode(SaveMode.Append)
.saveAsTable(Outputhive_table)
Make sure that you enable support for dynamic partitions by executing something like
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f
I think the problem is that some records have appname and year_month as strings. At least this is suggested by
Partition spec is incorrect. {appname=, year_month=}
Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

Loading data from glue to snowflake

I am trying to run an ETL job on glue where I am extracting data into a spark dataframe from a mongodb into glue and load it into snowflake.
This is the sample schema of the Spark dataframe
|-- login: struct (nullable = true)
| |-- login_attempts: integer (nullable = true)
| |-- last_attempt: timestamp (nullable = true)
|-- name: string (nullable = true)
|-- notifications: struct (nullable = true)
| |-- bot_review_queue: boolean (nullable = true)
| |-- bot_review_queue_web_push: boolean (nullable = true)
| |-- bot_review_queue_web_push_admin: boolean (nullable = true)
| |-- weekly_account_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| |-- weekly_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| | |-- day: integer (nullable = true)
| | |-- hour: integer (nullable = true)
| | |-- minute: integer (nullable = true)
|-- query: struct (nullable = true)
| |-- email_address: string (nullable = true)
I am trying to load the data into snowflake as it is and struct columns as json payload in snowflake but it throws the following error
An error occurred while calling o81.collectToPython.com.mongodb.spark.exceptions.MongoTypeConversionException:Cannot cast ARRAY into a StructType
I also tried to cast the struct columns into string and load it but it throws more or less the same error
An error occurred while calling o106.save. com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType
Really appreciate if I can get some help on it.
code below for casting and loading.
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
connection_options=read_mongo_options)
user_df_cast = user_df.select(user_df.login.cast(StringType()),'name',user_df.notifications.cast(StringType()))
datasinkusers = user_df_cast.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "users").mode("append").save()
If your users table in Snowflake has the following schema then casting is not required, as the StructType fields of a SparkSQL DataFrame will map to the VARIANT type in Snowflake automatically:
CREATE TABLE users (
login VARIANT
,name STRING
,notifications VARIANT
,query VARIANT
)
Just do the following, no transformations required because the Snowflake Spark Connector understands the data-type and will convert to appropriate JSON representations on its own:
user_df = glueContext.create_dynamic_frame.from_options(
connection_type="mongodb",
connection_options=read_mongo_options
)
user_df
.toDF()
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("dbtable", "users")
.mode("append")
.save()
If you absolutely need to store the StructType fields as plain JSON strings, you'll need to explicitly transform them using the to_json SparkSQL function:
from pyspark.sql.functions import to_json
user_df_cast = user_df.select(
to_json(user_df.login),
user_df.name,
to_json(user_df.notifications)
)
This will store JSON strings as simple VARCHAR types which will not let you leverage Snowflake's semi-structured data storage and querying capabilities directly without a PARSE_JSON step (inefficient).
Consider using the VARIANT approach shown above, which will allow you to perform queries on the fields directly:
SELECT
login:login_attempts
,login:last_attempt
,name
,notifications:weekly_summary.enabled
FROM users

How to add a new nullable String column in a DataFrame using Scala

There are probably at least 10 question very similar to this, but I still have not found a clear answer.
How can I add a nullable string column to a DataFrame using scala? I was able to add a column with null values, but the DataType shows null
val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", null).otherwise(null))
However, the schema shows
root
|-- UID: string (nullable = true)
|-- IsPartnerInd: string (nullable = true)
|-- newcolumn: null (nullable = true)
I want the new column to be string |-- newcolumn: string (nullable = true)
Please don't mark as duplicate, unless it's really the same question and in scala.
Just explicitly cast null literal to StringType.
scala> val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", lit(null).cast(StringType)).otherwise(lit(null).cast(StringType)))
scala> testDF.printSchema
root
|-- UID: string (nullable = true)
|-- newcolumn: string (nullable = true)
Why do you want a column which is always null? There are several ways, I would prefer the solution with typedLit:
myDF.withColumn("newcolumn", typedLit[String](null))
or for older Spark versions:
myDF.withColumn("newcolumn",lit(null).cast(StringType))

Spark-xml creating `_VALUE` column which colflicts with other column with _value

I am using Spark to process some datas stored in an XML file.
I successfuly loaded my datas and printed the schema :
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.load(myPath+"/myfile.xml")
df.printSchema
Which give me a result that look like this :
root
|-- _id: string (nullable = true)
|-- _type: string (nullable = true)
|-- creationDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
|-- lastUpdateDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
From this datas, I want to extract only certain fields , which should be easy with a 'select'. So I am doing the folowing request :
df.select("_id","creationDate._value","lastUpdateDate._value")
But I get the error :
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(_VALUE,StringType,true), StructField(_value,StringType,true);
My problem is that spark sql is not case sensitive and my file contains field _value and _VALUE and I can't change my input file.
Is there a way to solve this probleme with Spark?
Spark-xml creates _VALUE there is no child in a xml tag which cause conflict with other.
You can change default value _VALUE by adding option while reading xml as
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.option("valueTag", "anyName")
.load(myPath+"/myfile.xml")
Hope this helps!

Change Data Types for Dataframe by Schema in Scala Spark

I have a dataframe without schema and every column stored as StringType such as:
ID | LOG_IN_DATE | USER
1 | 2017-11-01 | Johns
Now I created a schema dataframe as [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")] and I would like to apply to the above Dataframe in Spark 2.0.2 with Scala 2.11.
I already tried:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
There's no error while running this but afterwards when I call the df.schema, nothing is changed.
Any idea on how I could programmatically apply the schema to df? My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd.
If you already have a list [(ID,"double"),("LOG_IN_DATE","date"),(USER,"string")], you can use select with each column casting to its type from the list
Your dataframe
val df = Seq(("1", "2017-11-01", "Johns"), ("2", "2018-01-03", "jons2")).toDF("ID", "LOG_IN_DATE", "USER")
Your schema
val schema = List(("ID", "double"), ("LOG_IN_DATE", "date"), ("USER", "string"))
Cast all the columns to its type from the list
val newColumns = schema.map(c => col(c._1).cast(c._2))
select all te casted columns
val newDF = df.select(newColumns:_*)
Print Schema
newDF.printSchema()
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Show Dataframe
newDF.show()
Output:
+---+-----------+-----+
|ID |LOG_IN_DATE|USER |
+---+-----------+-----+
|1.0|2017-11-01 |Johns|
|2.0|2018-01-03 |Jons2|
+---+-----------+-----+
My friend told me I can use foldLeft method but I don't think this is a method in Spark 2.0.2 neither in df nor rdd
Yes, foldLeft is the way to go
This is the schema before using foldLeft
root
|-- ID: string (nullable = true)
|-- LOG_IN_DATE: string (nullable = true)
|-- USER: string (nullable = true)
Using foldLeft
val schema = List(("ID","double"),("LOG_IN_DATE","date"),("USER","string"))
import org.apache.spark.sql.functions._
schema.foldLeft(df){case(tempdf, x)=> tempdf.withColumn(x._1, col(x._1).cast(x._2))}.printSchema()
and this is the schema after foldLeft
root
|-- ID: double (nullable = true)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
I hope the answer is helpful
If you apply any function of Scala, It returns modified data so you can't change the data type of existing schema.
Below is the code to create new data frame of modified schema by casting column.
1.Create a new DataFrame
val df=Seq((1,"2017-11-01","Johns"),(2,"2018-01-03","Alice")).toDF("ID","LOG_IN_DATE","USER")
2.Register DataFrame as temp table
df.registerTempTable("user")
3.Now create new DataFrame by casting column data type
val new_df=spark.sql("""SELECT ID,TO_DATE(CAST(UNIX_TIMESTAMP(LOG_IN_DATE, 'yyyy-MM-dd') AS TIMESTAMP)) AS LOG_IN_DATE,USER from user""")
4. Display schema
new_df.printSchema
root
|-- ID: integer (nullable = false)
|-- LOG_IN_DATE: date (nullable = true)
|-- USER: string (nullable = true)
Actually what you did:
schema.map(x => df.withColumn(x._1, col(x._1).cast(x._2)))
could work but you need to define your dataframe as a var and do like this:
for((name, type) <- schema) {
df = df.withColumn(name, col(name).cast(type)))
}
Also you could try reading your dataframe like this:
case class MyClass(ID: Int, LOG_IN_DATE: Date, USER: String)
//Suppose you are reading from json
val df = spark.read.json(path).as[MyClass]
Hope this helps!