Scala Spark setting schema duplicates columns - scala

I have an issue when specifying the schema of my dataframe. Without setting the schema, printschema() produces:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
However, when i specify the schema with .schema(schema)
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
My printschema() produces:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
The dataframe itself has all these duplicate columns, and i'm not sure why.
My code:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()

It's working as expected. Your train_df, features_df have the same columns as schema (14 columns) after your load().
After your join condition , Seq("Store", "Date", "IsHoliday") takes thes 3 columns from the both DFs(total 3+3 =6 columns) and join it and gives one set of columns names(3 columns). But rest of columns will be from both train_df(rest 11 columns), features_df(rest 11 columns).
Hence you printSchema showing 25 columns(3 + 11 + 11).

Related

Not able to add data through with column using case statement pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot
Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

How to get correlation matrix for Scala dataframe

I have Scala dataframe with numeric data:
df2_num.printSchema
root
|-- ot2_total_sum: decimal(38,18) (nullable = true)
|-- s42_3: decimal(38,0) (nullable = true)
|-- s109_5: decimal(38,0) (nullable = true)
|-- is_individual: decimal(38,0) (nullable = true)
|-- s118_5: decimal(38,0) (nullable = true)
|-- s46_3: decimal(38,0) (nullable = true)
|-- ot1_nds_10: decimal(38,18) (nullable = true)
|-- s45_3: decimal(38,0) (nullable = true)
|-- s10_3: decimal(38,0) (nullable = true)
|-- nb: decimal(38,0) (nullable = true)
|-- s80_5: decimal(38,0) (nullable = true)
|-- ot2_nds_10: decimal(38,18) (nullable = true)
|-- pr: decimal(38,0) (nullable = true)
|-- IP: integer (nullable = true)
|-- s70_5: decimal(38,0) (nullable = true)
|-- ot1_sum_without_nds: decimal(38,18) (nullable = true)
|-- s109_3: decimal(38,0) (nullable = true)
|-- s60_3: decimal(38,0) (nullable = true)
|-- s190_3: decimal(38,0) (nullable = true)
|-- ot3_total_sum: decimal(38,18) (nullable = true)
|-- s130_3: decimal(38,0) (nullable = true)
|-- region: integer (nullable = true)
|-- s170_3: decimal(38,0) (nullable = true)
|-- s20_3: decimal(38,0) (nullable = true)
|-- s90_5: decimal(38,0) (nullable = true)
|-- ot2_nds_20: decimal(38,18) (nullable = true)
|-- s70_3: decimal(38,0) (nullable = true)
|-- ot1_nds_0: decimal(38,18) (nullable = true)
|-- s200_3: decimal(38,0) (nullable = true)
|-- ot2_sum_without_nds: decimal(38,18) (nullable = true)
|-- ot1_nds_20: decimal(38,18) (nullable = true)
|-- s120_3: decimal(38,0) (nullable = true)
|-- s150_3: decimal(38,0) (nullable = true)
|-- s40_3: decimal(38,0) (nullable = true)
|-- s10_5: decimal(38,0) (nullable = true)
|-- nalog: decimal(38,0) (nullable = true)
|-- ot1_total_sum: decimal(38,18) (nullable = true)
I need to get correlation matrix for all columns of this dataframe.
I've tried to use org.apache.spark.mllib.stat.Statistics.corr . It reqiues RDD data , so I've converted my dataframe to RDD
val df2_num_rdd = df2_num.rdd
Then I try to use Statistics.cor , and get error:
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
<console>:82: error: overloaded method value corr with alternatives:
(x: org.apache.spark.api.java.JavaRDD[java.lang.Double],y: org.apache.spark.api.java.JavaRDD[java.lang.Double])scala.Double <and>
(x: org.apache.spark.rdd.RDD[scala.Double],y: org.apache.spark.rdd.RDD[scala.Double])scala.Double <and>
(X: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],method: String)org.apache.spark.mllib.linalg.Matrix
cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row], String)
val correlMatrix = Statistics.corr(df2_num_rdd , "pearson")
So how I need to handle my data for Statistics.corr ?
Assuming you're running a relatively recent version of Spark, I suggest using org.apache.spark.ml.stat.Correlation.corr instead.
First, you have to assemble the columns for which you want to compute correlation, and then you can get correlations as a dataframe. From here, you can fetch the first row and transform it to whatever suits your needs.
Here is an example :
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.stat.Correlation
val assembled: DataFrame = new VectorAssembler()
.setInputCols(df2_num.columns)
.setOutputCol("correlations")
.transform(df2_num)
val correlations: DataFrame =
Correlation.corr(assembled, column = "correlations", method = "pearson")
Here are some useful links for guides related to this approach :
Spark MLlib Guide : Correlation
Spark MLlib Guide : VectorAssembler
.getAs[DenseMatrix] in correlations.first.getAs[DenseMatrix] throwing an error.
#H.Leger - How would you convert the final result to a proper matrix of this format
Column
c1
c2
c3
c1
1
0.97
0.92
c2
0.97
1
0.94
c3
0.92
0.94
1

How to select columns in PySpark which do not contain strings

I had the problem on how to remove the columns with strings in Pyspark, keeping only, numerical ones and timestamps.
This is how I did it.
I had this:
full_log.printSchema()
root
|-- ProgramClassID: integer (nullable = true)
|-- CategoryID: integer (nullable = true)
|-- LogServiceID: integer (nullable = true)
|-- LogDate: timestamp (nullable = true)
|-- AudienceTargetAgeID: integer (nullable = true)
|-- AudienceTargetEthnicID: integer (nullable = true)
|-- ClosedCaptionID: integer (nullable = true)
|-- CountryOfOriginID: integer (nullable = true)
|-- DubDramaCreditID: integer (nullable = true)
|-- EthnicProgramID: integer (nullable = true)
|-- ProductionSourceID: integer (nullable = true)
|-- FilmClassificationID: integer (nullable = true)
|-- ExhibitionID: integer (nullable = true)
|-- Duration: string (nullable = true)
|-- EndTime: string (nullable = true)
|-- LogEntryDate: timestamp (nullable = true)
|-- ProductionNO: string (nullable = true)
|-- ProgramTitle: string (nullable = true)
|-- StartTime: string (nullable = true)
This will get the list of column names to filter
no_string_columns = [types[0] for types in full_log.dtypes if types[1] != 'string']
Perform the final selection
full_log_no_strings = full_log.select([*no_string_columns])
You can also use the schema object of the dataframe:
from pyspark.sql.types import *
string_columns = [column.name for column in full_log.schema if column.dataType != StringType()]
You can use below logic for this use case.
Step:1 Finding the columns which are string as datatype .
Step:2 remove those columns list
Step:3 apply your logic
Code snippet :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
print(columnList)
df1 = df.drop(*columnList)
display(df1)
Screen shot:

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.
You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

How to make matching schema for two data frame in join without hard coding for every columns

I have two data frame on which I perform join and some time i get below error
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`IsAnnualReported_1` IS NOT NULL) THEN `IsAnnualReported_1` ELSE CAST(`IsAnnualReported` AS BOOLEAN) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
Now to overcome this i have to manually cast into to matching data types like below for all machinating data type columns .
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
This is how i perform join on two data frames .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/MAIN")
val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1resultFinal=df1result.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
val df2 = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/INCR")
val df2With_ = df2.toDF(df2.columns.map(_.replace(".", "_")): _*)
val df2column_to_keep = df2With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df2result = df2With_.select(df2column_to_keep.head, df2column_to_keep.tail: _*)
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("FinancialPeriod_organizationId", "FinancialPeriod_periodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
df1resultFinalWithYear.printSchema()
latestForEachKey.printSchema()
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("FinancialPeriod_organizationId", "FinancialPeriod_periodId"), "outer")
.select($"FinancialPeriod_organizationId", $"FinancialPeriod_periodId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"PartitionYear_1".isNotNull, $"PartitionYear_1").otherwise($"PartitionYear".cast(DataTypes.StringType)).as("PartitionYear"),
when($"FinancialPeriod_periodEndDate_1".isNotNull, $"FinancialPeriod_periodEndDate_1").otherwise($"FinancialPeriod_periodEndDate").as("FinancialPeriod_periodEndDate"),
when($"FinancialPeriod_periodStartDate_1".isNotNull, $"FinancialPeriod_periodStartDate_1").otherwise($"FinancialPeriod_periodStartDate").as("FinancialPeriod_periodStartDate"),
when($"FinancialPeriod_periodDuration_1".isNotNull, $"FinancialPeriod_periodDuration_1").otherwise($"FinancialPeriod_periodDuration").as("FinancialPeriod_periodDuration"),
when($"FinancialPeriod_nonStandardPeriod_1".isNotNull, $"FinancialPeriod_nonStandardPeriod_1").otherwise($"FinancialPeriod_nonStandardPeriod").as("FinancialPeriod_nonStandardPeriod"),
when($"FinancialPeriod_periodType_1".isNotNull, $"FinancialPeriod_periodType_1").otherwise($"FinancialPeriod_periodType").as("FinancialPeriod_periodType"),
when($"PeriodFiscalYear_1".isNotNull, $"PeriodFiscalYear_1").otherwise($"PeriodFiscalYear").as("PeriodFiscalYear"),
when($"PeriodFiscalEndMonth_1".isNotNull, $"PeriodFiscalEndMonth_1").otherwise($"PeriodFiscalEndMonth").as("PeriodFiscalEndMonth"),
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
when($"IsTransitional_1".isNotNull, $"IsTransitional_1").otherwise($"IsTransitional".cast(DataTypes.StringType)).as("IsTransitional"),
when($"CumulativeType_1".isNotNull, $"CumulativeType_1").otherwise($"CumulativeType").as("CumulativeType"),
when($"CalendarizedPeriodEndDate_1".isNotNull, $"CalendarizedPeriodEndDate_1").otherwise($"CalendarizedPeriodEndDate").as("CalendarizedPeriodEndDate"),
when($"EarliestAnnouncementDateTime_1".isNotNull, $"EarliestAnnouncementDateTime_1").otherwise($"EarliestAnnouncementDateTime").as("EarliestAnnouncementDateTime"),
when($"EADUTCOffset_1".isNotNull, $"EADUTCOffset_1").otherwise($"EADUTCOffset").as("EADUTCOffset"),
when($"PeriodPermId_1".isNotNull, $"PeriodPermId_1").otherwise($"PeriodPermId").as("PeriodPermId"),
when($"PeriodPermId_objectTypeId_1".isNotNull, $"PeriodPermId_objectTypeId_1").otherwise($"PeriodPermId_objectTypeId").as("PeriodPermId_objectTypeId"),
when($"PeriodPermId_objectType_1".isNotNull, $"PeriodPermId_objectType_1").otherwise($"PeriodPermId_objectType").as("PeriodPermId_objectType"),
when($"CumulativeTypeId_1".isNotNull, $"CumulativeTypeId_1").otherwise($"CumulativeTypeId").as("CumulativeTypeId"),
when($"PeriodTypeId_1".isNotNull, $"PeriodTypeId_1").otherwise($"PeriodTypeId").as("PeriodTypeId"),
when($"PeriodFiscalEndMonthId_1".isNotNull, $"PeriodFiscalEndMonthId_1").otherwise($"PeriodFiscalEndMonthId").as("PeriodFiscalEndMonthId"),
when($"PeriodLengthUnitId_1".isNotNull, $"PeriodLengthUnitId_1").otherwise($"PeriodLengthUnitId").as("PeriodLengthUnitId"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Now what I need is that, how can I create second data frame with the schema of first data frame so i will never get any error like data type mismatch .
Here is the schema of first and second data frame
root
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod: string (nullable = true)
|-- FinancialPeriod_periodType: string (nullable = true)
|-- PeriodFiscalYear: integer (nullable = true)
|-- PeriodFiscalEndMonth: integer (nullable = true)
|-- IsAnnualReported: boolean (nullable = true)
|-- IsTransitional: boolean (nullable = true)
|-- CumulativeType: string (nullable = true)
|-- CalendarizedPeriodEndDate: string (nullable = true)
|-- EarliestAnnouncementDateTime: timestamp (nullable = true)
|-- EADUTCOffset: string (nullable = true)
|-- PeriodPermId: string (nullable = true)
|-- PeriodPermId_objectTypeId: string (nullable = true)
|-- PeriodPermId_objectType: string (nullable = true)
|-- CumulativeTypeId: integer (nullable = true)
|-- PeriodTypeId: integer (nullable = true)
|-- PeriodFiscalEndMonthId: integer (nullable = true)
|-- PeriodLengthUnitId: integer (nullable = true)
|-- FFAction: string (nullable = true)
|-- DataPartition: string (nullable = true)
|-- PartitionYear: string (nullable = true)
root
|-- DataPartition_1: string (nullable = true)
|-- PartitionYear_1: integer (nullable = true)
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration_1: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod_1: string (nullable = true)
|-- FinancialPeriod_periodType_1: string (nullable = true)
|-- PeriodFiscalYear_1: string (nullable = true)
|-- PeriodFiscalEndMonth_1: string (nullable = true)
|-- IsAnnualReported_1: string (nullable = true)
|-- IsTransitional_1: string (nullable = true)
|-- CumulativeType_1: string (nullable = true)
|-- CalendarizedPeriodEndDate_1: string (nullable = true)
|-- EarliestAnnouncementDateTime_1: string (nullable = true)
|-- EADUTCOffset_1: string (nullable = true)
|-- PeriodPermId_1: string (nullable = true)
|-- PeriodPermId_objectTypeId_1: string (nullable = true)
|-- PeriodPermId_objectType_1: string (nullable = true)
|-- CumulativeTypeId_1: string (nullable = true)
|-- PeriodTypeId_1: string (nullable = true)
|-- PeriodFiscalEndMonthId_1: string (nullable = true)
|-- PeriodLengthUnitId_1: string (nullable = true)
|-- FFAction_1: string (nullable = true)
You already have a good solution .
Here I am going to show you how you can avoid writing manually each columns for typecasting.
Lets say you have two dataframes (as you already have them) as
df1
root
|-- col1: integer (nullable = false)
|-- col2: string (nullable = true)
df2
root
|-- cl2: integer (nullable = false)
|-- cl1: integer (nullable = false)
suppose you want to change the dataTypes of df2 as that of df1. and as you said you know the mapping of each columns of both dataframes. You have to create Map of the relationship of the columns
val columnMaps = Map("col1" -> "cl1", "col2"->"cl2")
When you have the map as above, you can set the dataTypes to be casted to each columns of df2 as below
val schema1 = df1.schema
val toBeChangedDataTypes =df1.schema.map(x => if(columnMaps.keySet.contains(x.name)) (columnMaps(x.name), x.dataType) else (x.name, x.dataType)).toList
Then you can change the dataTypes of the columns of df2 to match with df1 by calling a recursive function
val finalDF = castingFunction(toBeChangedDataTypes, df2)
where castingFunction is a recursive function defined as
import org.apache.spark.sql.functions.col
def castingFunction(typeList: List[Tuple2[String, DataType]], df: DataFrame) : DataFrame = typeList match {
case x :: y => castingFunction(y, df.withColumn(x._1, col(x._1).cast(x._2)))
case Nil => df
}
You will see that finalDF will have schema as
root
|-- cl2: string (nullable = false)
|-- cl1: integer (nullable = false)
You can do the same for your dataframes.
I hope the answer is helpful