How to select columns in PySpark which do not contain strings

How to select columns in PySpark which do not contain strings - pyspark

I had the problem on how to remove the columns with strings in Pyspark, keeping only, numerical ones and timestamps.
This is how I did it.
I had this:
full_log.printSchema()
root
|-- ProgramClassID: integer (nullable = true)
|-- CategoryID: integer (nullable = true)
|-- LogServiceID: integer (nullable = true)
|-- LogDate: timestamp (nullable = true)
|-- AudienceTargetAgeID: integer (nullable = true)
|-- AudienceTargetEthnicID: integer (nullable = true)
|-- ClosedCaptionID: integer (nullable = true)
|-- CountryOfOriginID: integer (nullable = true)
|-- DubDramaCreditID: integer (nullable = true)
|-- EthnicProgramID: integer (nullable = true)
|-- ProductionSourceID: integer (nullable = true)
|-- FilmClassificationID: integer (nullable = true)
|-- ExhibitionID: integer (nullable = true)
|-- Duration: string (nullable = true)
|-- EndTime: string (nullable = true)
|-- LogEntryDate: timestamp (nullable = true)
|-- ProductionNO: string (nullable = true)
|-- ProgramTitle: string (nullable = true)
|-- StartTime: string (nullable = true)
This will get the list of column names to filter
no_string_columns = [types[0] for types in full_log.dtypes if types[1] != 'string']
Perform the final selection
full_log_no_strings = full_log.select([*no_string_columns])

You can also use the schema object of the dataframe:
from pyspark.sql.types import *
string_columns = [column.name for column in full_log.schema if column.dataType != StringType()]

You can use below logic for this use case.
Step:1 Finding the columns which are string as datatype .
Step:2 remove those columns list
Step:3 apply your logic
Code snippet :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
print(columnList)
df1 = df.drop(*columnList)
display(df1)
Screen shot:

Related

PySpark - When Otherwise - Condition should be a Column

I have a dataframe as below
root
|-- tasin: string (nullable = true)
|-- advertiser_id: decimal(38,10) (nullable = true)
|-- predicted_sp_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sp_impressions: decimal(38,10) (nullable = true)
|-- predicted_sp_clicks: decimal(38,10) (nullable = true)
|-- predicted_sdc_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sdc_impressions: decimal(38,10) (nullable = true)
|-- predicted_sdc_clicks: decimal(38,10) (nullable = true)
|-- predicted_sda_sold_units: decimal(38,10) (nullable = true)
|-- predicted_sda_impressions: decimal(38,10) (nullable = true)
|-- predicted_sda_clicks: decimal(38,10) (nullable = true)
|-- region_id: integer (nullable = true)
|-- marketplace_id: integer (nullable = true)
|-- dataset_date: date (nullable = true)
Now I am using the below select statement. I am looking for presence of a column name and if present select the value or else fill with Null. The dataframe is stored in df variable.
scores_df1 = df.select(
col('marketplace_id'),
col('region_id'),
col('tasin'),
col('advertiser_id'),
col('predicted_sp_sold_units'),
col('predicted_sp_impressions'),
col('predicted_sp_clicks'),
col('predicted_sdc_sold_units'),
col('predicted_sdc_impressions'),
col('predicted_sdc_clicks'),
col('predicted_sda_sold_units'),
col('predicted_sda_impressions'),
col('predicted_sda_clicks'),
when('sdcr_score' in df.columns is True, col('sdcr_score')).otherwise(lit(None)).alias('sdcr_score'),
when('sdar_score' in df.columns is True, col('sdar_score')).otherwise(lit(None)).alias('sdar_score')
)
I am receiving error <class 'TypeError'>: condition should be a Column
Please advice what is wrong

the phrase 'sdcr_score' in df.columns is True is evaluated in Python before moving to spark and return True/False.
So what you are passing to spark is: when(True, ..., ...).
When is expecting the first argument to be a Column that is evaluated to a True/False statement and not a Pythonic Bool.
You can wrap the argument with lit() function which will basically pass a True/False column to all arguments of the when clause.

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]

Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.

You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

Scala Spark setting schema duplicates columns

I have an issue when specifying the schema of my dataframe. Without setting the schema, printschema() produces:
root
|-- Store: string (nullable = true)
|-- Date: string (nullable = true)
|-- IsHoliday: string (nullable = true)
|-- Dept: string (nullable = true)
|-- Weekly_Sales: string (nullable = true)
|-- Temperature: string (nullable = true)
|-- Fuel_Price: string (nullable = true)
|-- MarkDown1: string (nullable = true)
|-- MarkDown2: string (nullable = true)
|-- MarkDown3: string (nullable = true)
|-- MarkDown4: string (nullable = true)
|-- MarkDown5: string (nullable = true)
|-- CPI: string (nullable = true)
|-- Unemployment: string (nullable = true)
However, when i specify the schema with .schema(schema)
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
My printschema() produces:
root
|-- Store: integer (nullable = true)
|-- Date: date (nullable = true)
|-- IsHoliday: boolean (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
|-- Dept: integer (nullable = true)
|-- Weekly_Sales: integer (nullable = true)
|-- Temperature: double (nullable = true)
|-- Fuel_Price: double (nullable = true)
|-- MarkDown1: double (nullable = true)
|-- MarkDown2: double (nullable = true)
|-- MarkDown3: double (nullable = true)
|-- MarkDown4: double (nullable = true)
|-- MarkDown5: double (nullable = true)
|-- CPI: double (nullable = true)
|-- Unemployment: double (nullable = true)
The dataframe itself has all these duplicate columns, and i'm not sure why.
My code:
// Make cutom schema
val schema = StructType(Array(
StructField("Store", IntegerType, true),
StructField("Date", DateType, true),
StructField("IsHoliday", BooleanType, true),
StructField("Dept", IntegerType, true),
StructField("Weekly_Sales", IntegerType, true),
StructField("Temperature", DoubleType, true),
StructField("Fuel_Price", DoubleType, true),
StructField("MarkDown1", DoubleType, true),
StructField("MarkDown2", DoubleType, true),
StructField("MarkDown3", DoubleType, true),
StructField("MarkDown4", DoubleType, true),
StructField("MarkDown5", DoubleType, true),
StructField("CPI", DoubleType, true),
StructField("Unemployment", DoubleType, true)))
val dfr = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(schema)
val train_df = dfr.load("/FileStore/tables/train.csv")
val features_df = dfr.load("/FileStore/tables/features.csv")
// Combine the train and features
val data = train_df.join(features_df, Seq("Store", "Date", "IsHoliday"), "left")
data.show(5)
data.printSchema()

It's working as expected. Your train_df, features_df have the same columns as schema (14 columns) after your load().
After your join condition , Seq("Store", "Date", "IsHoliday") takes thes 3 columns from the both DFs(total 3+3 =6 columns) and join it and gives one set of columns names(3 columns). But rest of columns will be from both train_df(rest 11 columns), features_df(rest 11 columns).
Hence you printSchema showing 25 columns(3 + 11 + 11).

How to make matching schema for two data frame in join without hard coding for every columns

I have two data frame on which I perform join and some time i get below error
org.apache.spark.sql.AnalysisException: cannot resolve 'CASE WHEN (`IsAnnualReported_1` IS NOT NULL) THEN `IsAnnualReported_1` ELSE CAST(`IsAnnualReported` AS BOOLEAN) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
Now to overcome this i have to manually cast into to matching data types like below for all machinating data type columns .
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
This is how i perform join on two data frames .
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val get_cus_YearPartition = spark.udf.register("get_cus_YearPartition", (filePath: String) => filePath.split("\\.")(4))
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/MAIN")
val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1resultFinal=df1result.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalWithYear=df1resultFinal.withColumn("PartitionYear", get_cus_YearPartition(input_file_name))
val df2 = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialPeriod/INCR")
val df2With_ = df2.toDF(df2.columns.map(_.replace(".", "_")): _*)
val df2column_to_keep = df2With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df2result = df2With_.select(df2column_to_keep.head, df2column_to_keep.tail: _*)
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("FinancialPeriod_organizationId", "FinancialPeriod_periodId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2result.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
df1resultFinalWithYear.printSchema()
latestForEachKey.printSchema()
val dfMainOutput = df1resultFinalWithYear.join(latestForEachKey, Seq("FinancialPeriod_organizationId", "FinancialPeriod_periodId"), "outer")
.select($"FinancialPeriod_organizationId", $"FinancialPeriod_periodId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition".cast(DataTypes.StringType)).as("DataPartition"),
when($"PartitionYear_1".isNotNull, $"PartitionYear_1").otherwise($"PartitionYear".cast(DataTypes.StringType)).as("PartitionYear"),
when($"FinancialPeriod_periodEndDate_1".isNotNull, $"FinancialPeriod_periodEndDate_1").otherwise($"FinancialPeriod_periodEndDate").as("FinancialPeriod_periodEndDate"),
when($"FinancialPeriod_periodStartDate_1".isNotNull, $"FinancialPeriod_periodStartDate_1").otherwise($"FinancialPeriod_periodStartDate").as("FinancialPeriod_periodStartDate"),
when($"FinancialPeriod_periodDuration_1".isNotNull, $"FinancialPeriod_periodDuration_1").otherwise($"FinancialPeriod_periodDuration").as("FinancialPeriod_periodDuration"),
when($"FinancialPeriod_nonStandardPeriod_1".isNotNull, $"FinancialPeriod_nonStandardPeriod_1").otherwise($"FinancialPeriod_nonStandardPeriod").as("FinancialPeriod_nonStandardPeriod"),
when($"FinancialPeriod_periodType_1".isNotNull, $"FinancialPeriod_periodType_1").otherwise($"FinancialPeriod_periodType").as("FinancialPeriod_periodType"),
when($"PeriodFiscalYear_1".isNotNull, $"PeriodFiscalYear_1").otherwise($"PeriodFiscalYear").as("PeriodFiscalYear"),
when($"PeriodFiscalEndMonth_1".isNotNull, $"PeriodFiscalEndMonth_1").otherwise($"PeriodFiscalEndMonth").as("PeriodFiscalEndMonth"),
when($"IsAnnualReported_1".isNotNull, $"IsAnnualReported_1").otherwise($"IsAnnualReported".cast(DataTypes.BooleanType)).as("IsAnnualReported"),
when($"IsTransitional_1".isNotNull, $"IsTransitional_1").otherwise($"IsTransitional".cast(DataTypes.StringType)).as("IsTransitional"),
when($"CumulativeType_1".isNotNull, $"CumulativeType_1").otherwise($"CumulativeType").as("CumulativeType"),
when($"CalendarizedPeriodEndDate_1".isNotNull, $"CalendarizedPeriodEndDate_1").otherwise($"CalendarizedPeriodEndDate").as("CalendarizedPeriodEndDate"),
when($"EarliestAnnouncementDateTime_1".isNotNull, $"EarliestAnnouncementDateTime_1").otherwise($"EarliestAnnouncementDateTime").as("EarliestAnnouncementDateTime"),
when($"EADUTCOffset_1".isNotNull, $"EADUTCOffset_1").otherwise($"EADUTCOffset").as("EADUTCOffset"),
when($"PeriodPermId_1".isNotNull, $"PeriodPermId_1").otherwise($"PeriodPermId").as("PeriodPermId"),
when($"PeriodPermId_objectTypeId_1".isNotNull, $"PeriodPermId_objectTypeId_1").otherwise($"PeriodPermId_objectTypeId").as("PeriodPermId_objectTypeId"),
when($"PeriodPermId_objectType_1".isNotNull, $"PeriodPermId_objectType_1").otherwise($"PeriodPermId_objectType").as("PeriodPermId_objectType"),
when($"CumulativeTypeId_1".isNotNull, $"CumulativeTypeId_1").otherwise($"CumulativeTypeId").as("CumulativeTypeId"),
when($"PeriodTypeId_1".isNotNull, $"PeriodTypeId_1").otherwise($"PeriodTypeId").as("PeriodTypeId"),
when($"PeriodFiscalEndMonthId_1".isNotNull, $"PeriodFiscalEndMonthId_1").otherwise($"PeriodFiscalEndMonthId").as("PeriodFiscalEndMonthId"),
when($"PeriodLengthUnitId_1".isNotNull, $"PeriodLengthUnitId_1").otherwise($"PeriodLengthUnitId").as("PeriodLengthUnitId"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"))
.filter(!$"FFAction".contains("D"))
Now what I need is that, how can I create second data frame with the schema of first data frame so i will never get any error like data type mismatch .
Here is the schema of first and second data frame
root
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod: string (nullable = true)
|-- FinancialPeriod_periodType: string (nullable = true)
|-- PeriodFiscalYear: integer (nullable = true)
|-- PeriodFiscalEndMonth: integer (nullable = true)
|-- IsAnnualReported: boolean (nullable = true)
|-- IsTransitional: boolean (nullable = true)
|-- CumulativeType: string (nullable = true)
|-- CalendarizedPeriodEndDate: string (nullable = true)
|-- EarliestAnnouncementDateTime: timestamp (nullable = true)
|-- EADUTCOffset: string (nullable = true)
|-- PeriodPermId: string (nullable = true)
|-- PeriodPermId_objectTypeId: string (nullable = true)
|-- PeriodPermId_objectType: string (nullable = true)
|-- CumulativeTypeId: integer (nullable = true)
|-- PeriodTypeId: integer (nullable = true)
|-- PeriodFiscalEndMonthId: integer (nullable = true)
|-- PeriodLengthUnitId: integer (nullable = true)
|-- FFAction: string (nullable = true)
|-- DataPartition: string (nullable = true)
|-- PartitionYear: string (nullable = true)
root
|-- DataPartition_1: string (nullable = true)
|-- PartitionYear_1: integer (nullable = true)
|-- FinancialPeriod_organizationId: long (nullable = true)
|-- FinancialPeriod_periodId: integer (nullable = true)
|-- FinancialPeriod_periodEndDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodStartDate_1: timestamp (nullable = true)
|-- FinancialPeriod_periodDuration_1: string (nullable = true)
|-- FinancialPeriod_nonStandardPeriod_1: string (nullable = true)
|-- FinancialPeriod_periodType_1: string (nullable = true)
|-- PeriodFiscalYear_1: string (nullable = true)
|-- PeriodFiscalEndMonth_1: string (nullable = true)
|-- IsAnnualReported_1: string (nullable = true)
|-- IsTransitional_1: string (nullable = true)
|-- CumulativeType_1: string (nullable = true)
|-- CalendarizedPeriodEndDate_1: string (nullable = true)
|-- EarliestAnnouncementDateTime_1: string (nullable = true)
|-- EADUTCOffset_1: string (nullable = true)
|-- PeriodPermId_1: string (nullable = true)
|-- PeriodPermId_objectTypeId_1: string (nullable = true)
|-- PeriodPermId_objectType_1: string (nullable = true)
|-- CumulativeTypeId_1: string (nullable = true)
|-- PeriodTypeId_1: string (nullable = true)
|-- PeriodFiscalEndMonthId_1: string (nullable = true)
|-- PeriodLengthUnitId_1: string (nullable = true)
|-- FFAction_1: string (nullable = true)

You already have a good solution .
Here I am going to show you how you can avoid writing manually each columns for typecasting.
Lets say you have two dataframes (as you already have them) as
df1
root
|-- col1: integer (nullable = false)
|-- col2: string (nullable = true)
df2
root
|-- cl2: integer (nullable = false)
|-- cl1: integer (nullable = false)
suppose you want to change the dataTypes of df2 as that of df1. and as you said you know the mapping of each columns of both dataframes. You have to create Map of the relationship of the columns
val columnMaps = Map("col1" -> "cl1", "col2"->"cl2")
When you have the map as above, you can set the dataTypes to be casted to each columns of df2 as below
val schema1 = df1.schema
val toBeChangedDataTypes =df1.schema.map(x => if(columnMaps.keySet.contains(x.name)) (columnMaps(x.name), x.dataType) else (x.name, x.dataType)).toList
Then you can change the dataTypes of the columns of df2 to match with df1 by calling a recursive function
val finalDF = castingFunction(toBeChangedDataTypes, df2)
where castingFunction is a recursive function defined as
import org.apache.spark.sql.functions.col
def castingFunction(typeList: List[Tuple2[String, DataType]], df: DataFrame) : DataFrame = typeList match {
case x :: y => castingFunction(y, df.withColumn(x._1, col(x._1).cast(x._2)))
case Nil => df
}
You will see that finalDF will have schema as
root
|-- cl2: string (nullable = false)
|-- cl1: integer (nullable = false)
You can do the same for your dataframes.
I hope the answer is helpful

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to select columns in PySpark which do not contain strings - pyspark

You can also use the schema object of the dataframe: from pyspark.sql.types import * string_columns = [column.name for column in full_log.schema if column.dataType != StringType()]

Related

PySpark - When Otherwise - Condition should be a Column

Pyspark create temp view from dataframe

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

Scala Spark setting schema duplicates columns

How to make matching schema for two data frame in join without hard coding for every columns

Categories

Resources