PySpark: Create new column based on null values in other columns - pyspark

I am working on a PySpark transformation to create a new column based on null values in another columns. Below is the sample input dataframe:
Input DataFrame
This is the expected output dataframe:
Output Dataframe

Hi welcome to stack overflow. It is probably a good idea to read the question best practices.
To build a column like that, the easiest way will probably be to build a column with your desired text corrosponding to nulls in each original column.
for example
cols = ["A", "B", "C", "D"]
new_cols = ["A_nulls", "B_nulls", "C_nulls", "D_nulls", "new_col"]
df = source_df.withColumn("new_col", F.lit("Null Columns:"))
df = df.withColumn("A_nulls", F.when(
F.col("A").isNotNull(), F.lit("A,"))
.otherwise("")
)
df = df.withColumn("B_nulls", F.when(
F.col("B").isNotNull(), F.lit("B,"))
.otherwise(""))
df = df.withColumn("C_nulls", F.when(
F.col("C").isNotNull(), F.lit("C,"))
.otherwise(""))
df = df.withColumn("D_nulls", F.when(
F.col("D").isNotNull(), F.lit("D,"))
.otherwise(""))
df = df.select(*cols, F.concat(*new_cols).alias("NewCol"))
If you want to remove the trailing , you can use F.regexp_replace("new_col", ",$", "") which should trim those.

Related

PySpark: Problem building logic to read data from multiple format files

I am facing issue creating empty dataframe with a defined number of columns in a list. I'll try to explain the issue here.
I don't know how to create empty data frame and what is the best way to iterate each file with multiple formats and merge the data in a single data fame
list_of_columns = [a,b,c,d]
finalDF = spark.createDataFrame([], schema=list_of_columns)
for file in list_of_files:
if format = '.csv':
df1 = spark.read.csv(CSVFile)
finalDF = df1.union(df1)
elif format = '.parquet':
df2 = spar.read.parque(ParquetFile)
finalDF = df2.union(df2)
finalDF.show()

Creating pyspark dataframe from same entries in the list

I have two list, one is col_name=['col1','col2','col3'] and the other is col_value=['val1', 'val2', 'val3']. I am trying to create a dataframe from the two list with col_name being the column.I need the output with 3 columns and 1 row (with 1 header as a col_name).
Finding difficult to get the solution for this. Pls help.
Construct data directly, and use the createDataFrame method to create a dataframe.
col_name = ['col1', 'col2', 'col3']
col_value = ['val1', 'val2', 'val3']
data = [col_value]
df = spark.createDataFrame(data, col_name)
df.printSchema()
df.show(truncate=False)

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

Drop list of Column from a single dataframe in spark

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var?
Currently I've done this
var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))
but what I really want is just a shortcut for
df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....
without using a var.
Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2
EDIT:
Important note: I don't know in advance the columns of df1 and df2
This is possible to achieve while you are performing the join itself. Please try the below code
val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")
This would only contain the columns from the second data frame. Hope this helps
A shortcut would be:
val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))
I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:
val ret = df3.select(df2.columns.map(col):_*)

Spark scala : select column name from other dataframe

There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.
For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!