Dataframe filtering with condition applied to list of columns - pyspark

I want to filter a pyspark dataframe if any of the string columns in a list are empty.
df = df.where(all([col(x)!='' for x in col_list]))
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

You can use reduce from functools to simulate all like this
from functools import reduce
spark_df.where(reduce(lambda x, y: x & y, (F.col(x) != '' for x in col_list))).show()

Since filter (or where) are lazy evaluated transformation we can merge multiple conditions by applying them one by one, e.g.
for c in col_list:
spark_df = spark_df.filter(col(c) != "")
spark_df.show()
Which may be a bit more readable, but in the end it will be executed in a completely same way as Sreeram's answer.
On a side note, removing rows with empty values would be most often done with
df.na.drop(how="any", subset=col_list)
but it only handles missing (null / None) values, not empty strings.

Related

Using 'where' instead of 'expr' when filtering for values in multiple columns in scala spark

I'm having some trouble refactoring a spark dataframe to not use expr but instead use dataframe filters and when conditionals.
My code is this:
outDF = outDF.withColumn("MAIN_TYPE", expr
("case when 'TYPE_A' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_A'" +
"when 'TYPE_B' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_B'" +
"when 'TYPE_C' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_C'" +
"when 'TYPE_D' in (GROUP_A,GROUP_B,GROUP_C,GROUP_D) then 'TYPE_D' else '0' end")
.cast(StringType))
The only solution that I could think of, so far is a series of individual .when().otherwise() chains, but that would require mXn lines, where m the number of Types and n the number of Groups that I need.
Is there any better way to do this kind of operation?
Thank you very much for your time!
So, this is how I worked this out, in case anyone is interested:
I used a helper column for the groups which I later dropped.
This is how this worked:
outDF = outDF.withColumn("Helper_Column", concat(col("Group_A"),col("Group_B"),
col("Group_C"),col("Group_D")))
outDF = outDF.withColumn("MAIN_TYPE", when(col("Helper_Column").like("%Type_A%"),"Type_A").otherwise(
when(col("Helper_Column").like("%Type_B%"),"Type_B").otherwise(
when(col("Helper_Column").like("%Type_C%"),"Type_C").otherwise(
when(col("Helper_Column").like("%Type_D%"),"Type_D").otherwise(lit("0")
)))))
outDF = outDF.drop("Helper_Column")
Hope this helps someone.

How to cast a column with data type List[null] to List[i64] in polars

Hey I have the following problem, I'd like to use the polars apply function on columns with the datatype List.
In most cases this works, but in some cases all lists in the column are empty and the column datatype is List[null], in that special case the code is crashing.
Here some example Code:
df = pl.from_pandas(pd.DataFrame(data=[
[[]],
[[]]
], columns=['A']))
df.with_columns(pl.col('A').apply(lambda x:x))
results in
pyo3_runtime.PanicException: Unwrapped panic from Python code
I think the problem can be easily solved by cast the datatype to another List datatype, but i have no Idea how to do that.
In polars>=0.13.11 you can:
df = pl.from_pandas(pd.DataFrame(data=[
[[]],
[[]]
], columns=['A']))
assert df["A"].cast(pl.List(pl.Int64)).dtype.inner == pl.Int64
assert df["A"].cast(pl.List(int)).dtype.inner == pl.Int64

Create multiple rows of fixed length from a data frame column in Pyspark

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.
detail_df.show()
DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345
detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)
For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.
Expected output Should be multiple rows in dataframe
DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543
ABC98762345
If you have spark 2.4+ version, we can make use of higher order functions to do it like below:
from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
transform(
sequence(0,length(DETAIL_REC),{n})
,x-> substring(DETAIL_REC,x+1,{n}))
,y->y <> '')"""))))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+
Logic used:
First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
Filter out any blank strings from the resulting array and explode this array.
For lower versions of spark, use a udf with textwrap or any other functions as addressed here:
from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))
output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

How to explode a struct column with a prefix?

My goal is to explode (ie, take them from inside the struct and expose them as the remaining columns of the dataset) a Spark struct column (already done) but changing the inner field names by prepending an arbitrary string. One of the motivations is that my struct can contain columns that have the same name as columns outside of it - therefore, I need a way to differentiate them easily. Of course, I do not know beforehand what are the columns inside my struct.
Here is what I have so far:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = df.select("*", column + ".*").drop(column)
}
This does the job alright - I use this writing:
df.explodeStruct("myColumn")
It returns all the columns from the original dataframe, plus the inner columns of the struct at the end.
As for prepending the prefix, my idea is to take the column and find out what are its inner columns. I browsed the documentation and could not find any method on the Column class that does that. Then, I changed my approach to taking the schema of the DataFrame, then filtering the result by the name of the column, and extracting the column found from the resulting array. The problem is that this element I find has the type StructField - which, again, presents no option to extract its inner field - whereas what I would really like is to get handled a StructType element - which has the .getFields method, that does exactly what I want (that is, showing me the name of the inner columns, so I can iterate over them and use them on my select, prepending the prefix I want to them). I know no way to convert a StructField to a StructType.
My last attempt would be to parse the output of StructField.toString - which contains all the names and types of the inner columns, although that feels really dirty, and I'd rather avoid that lowly approach.
Any elegant solution to this problem?
Well, after reading my own question again, I figured out an elegant solution to the problem - I just needed to select all the columns the way I was doing, and then compare it back to the original dataframe in order to figure out what were the new columns. Here is the final result - I also made this so that the exploded columns would show up in the same place as the original struct one, so not to break the flow of information:
implicit class Implicit(df: DataFrame) {
def explodeStruct(column: String) = {
val prefix = column + "_"
val originalPosition = df.columns.indexOf(column)
val dfWithAllColumns = df.select("*", column + ".*")
val explodedColumns = dfWithAllColumns.columns diff df.columns
val prefixedExplodedColumns = explodedColumns.map(c => col(column + "." + c) as prefix + c)
val finalColumnsList = df.columns.map(col).patch(originalPosition, prefixedExplodedColumns, 1)
df.select(finalColumnsList: _*)
}
}
Of course, you can customize the prefix, the separator, and etc - but that is simple, anyone could tweak the parameters and such. The usage remains the same.
In case anyone is interested, here is something similar for PySpark:
def explode_struct(df: DataFrame, column: str) -> DataFrame:
original_position = df.columns.index(column)
original_columns = df.columns
new_columns = df.select(column + ".*").columns
exploded_columns = [F.col(column + "." + c).alias(column + "_" + c) for c in new_columns]
col_list = [F.col(c) for c in df.columns]
col_list.pop(original_position)
col_list[original_position:original_position] = exploded_columns
return df.select(col_list)

PySpark RDD Filter with "not in" for multiple values

I have an RDD looks like below:
myRDD:
[[u'16/12/2006', u'17:24:00'],
[u'16/12/2006', u'?'],
[u'16/12/2006', u'']]
I want to exclude the records with '?' or '' in it.
Following code works for one by one filtering, but is there a way to combine and filter items with '?' and '' in one go to get back following:
[u'16/12/2006', u'17:24:00']
The below works only for one item at a time, how to extend to multiple items
myRDD.filter(lambda x: '?' not in x)
want help on how to write:
myRDD.filter(lambda x: '?' not in x && '' not in x)
Try this,
myRDD.filter(lambda x: ('?' not in x) & ('' not in x))