drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific) - pyspark

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?

drop_cols = df2.columns
df = df.drop(*drop_cols)

Related

PySpark: How to use select in a list comprehension but not truncate columns?

I want to trim the white space from my columns like so:
df = df.select( [trim(f"`{col}`").alias(col) for col in string_cols])
The problem is, that it is removing all my other columns from my dataframe. How can I do column-specific operations while not throwing away the other columns?
I would like to retain the list comprehension capability.
Doing it on my phone. Let me know if it doesn't work. Have you tried?
df = df.select('*',*[trim(f"{col}").alias(col) for col in string_cols])

Update multiple column values of a dataframe unconditionally with different variables

I have dataframe with some 10 columns. I have selected 4 columns out of these 10 and cleaned their values(by calling some external API and using its response). I would like to create new dataframe now (as old one cannot be updated) and update these 4 columns with its cleaned value(as returned by the API) and keep other 6 as is.
Tried exploring .na.replace and .withColumn but they all work on some condition for the columns.
val newdf = df.withColumn("col1", when(col("col1") === "XYZ", cleanedcol1)
.otherwise(col("col1")));
And
val newdf = df.na.replace("col1", Map("col1" -> cleanedcol1))
The first snippet matches col1 value with XYZ and then replaces it. I want unconditional change.
The second one actually tries to look for String "col1" for col1 column and hence does not replace anything.
What is the optimum approach to achieve this? The source of the df is Kafka and hence traffic would be fast.
You can have unconditional change with withCoulumn, just write
val newdf = df.withColumn("col1", newColumnValue)
Where newColumnValue is the new value you want to set for the column.

Spark: group only part of the rows in a DataFrame

From a given DataFrame, I'dl like to group only few rows together, and keep the other rows in the same dataframe.
My current solution is:
val aggregated = mydf.filter(col("check").equalTo("do_aggregate")).groupBy(...).agg()
val finalDF = aggregated.unionByName(mydf.filter(col("check").notEqual("do_aggregate")))
However I'd like to find a more eleguant and performant way.
Use a derived column to group by, depending on the check.
mydf.groupBy(when(col("check").equalTo("do_aggregate"), ...).otherwise(monotonically_increasing_id)).agg(...)
If you have a unique key in the dataframe, use that instead of monotonically_increasing_id.

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.
That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

How is the performance impact of select statements on Spark DataFrames?

Using many select statements or expressions on Spark DataFrames, I wonder what their performance impact is on subsequent transformations once triggered by an action.
Given a dataframe df with 10 columns a to j.
How is the influence if I use as for column renaming on each column?
df.select( df("a").as("1"), ..., df("j").as("10"))
What if I select a subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. How handles Spark this projection? Is df still kept (as df2 is a projection) so df could serve as kind of reference? Or is instead df2 created freshly and df discarded? (neglecting any persist here)
How is the influence of general Column expressions used in select?
Are performance tests for the above cases available? And are performance measurements in general somewhere available? If not, how to measure the performance best?