PySpark - ValueError: Cannot convert column into bool - pyspark

So I've seen this solution:
ValueError: Cannot convert column into bool
which has the solution I think. But I'm trying to make it work with my dataframe and can't figure out how to implement it.
My original code:
if df2['DayOfWeek']>=6 :
df2['WeekendOrHol'] = 1
this gives me the error:
Cannot convert column into bool: please use '&' for 'and', '|' for
'or', '~' for 'not' when building DataFrame boolean expressions.
So based on the above link I tried:
from pyspark.sql.functions import when
when((df2['DayOfWeek']>=6),df2['WeekendOrHol'] = 1)
when(df2['DayOfWeek']>=6,df2['WeekendOrHol'] = 1)
but this is incorrect as it gives me an error too.

To update a column based on a condition you need to use when like this:
from pyspark.sql import functions as F
# update `WeekendOrHol` column, when `DayOfWeek` >= 6,
# then set `WeekendOrHol` to 1 otherwise, set the value of `WeekendOrHol` to what it is now - or you could do something else.
# If no otherwise is provided then the column values will be set to None
df2 = df2.withColumn('WeekendOrHol',
F.when(
F.col('DayOfWeek') >= 6, F.lit(1)
).otherwise(F.col('WeekendOrHol')
)
Hope this helps, good luck!

Best answer as provided by pault:
df2=df2.withColumn("WeekendOrHol", (df2["DayOfWeek"]>=6).cast("int"))
This is a duplicate of:
this

Related

How to change the first letter of the column name to upper case in Polars?

I need to change the first letter of all column names to upper case. Is there a quick way to do it? I've found only rename() and prefix() functions.
Alternative solution, which doesn't involve assigning to a property:
df = df.rename({col: col.capitalize() for col in df.columns})
You can use df.columns (like in pandas):
df.columns = [col.capitalize() for col in df.columns]
I've found another one:
df.select(
pl.all().map_alias(lambda colName: colName.capitalize())
)

How to filter dataframe columns that start with something and end with something

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!
You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList
You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Escape hyphen when reading multiple dataframes at once in pyspark

I can't seem to find any documentation on this in pyspark's docs.
I am trying to read multiple parquets at once, like this:
df = sqlContext.read.option("basePath", "/some/path")\
.load("/some/path/date=[2018-01-01, 2018-01-02]")
And receive the following exception
java.util.regex.PatternSyntaxException: Illegal character range near index 11
E date=[2018-01-01,2018-01-02]
I tried replacing the hyphen with \-, but then I just receive a file not found exception.
I would appreciate help on that
Escaping the - is not your issue. You can not specify multiple dates in a list like that. If you'd like to read multiple files, you have a few options.
Option 1: Use * as a wildcard:
df = sqlContext.read.option("basePath", "/some/path").load("/some/path/date=2018-01-0*")
But this will also read any files named /some/path/data=2018-01-03 through /some/path/data=2018-01-09.
Option 2: Read each file individually and union the results
dates = ["2018-01-01", "2018-01-02"]
df = reduce(
lambda a, b: a.union(b),
[
sqlContext.read.option("basePath", "/some/path").load(
"/some/path/date={d}".format(d=d)
) for d in dates
]
)

How to extract number from string column?

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.
Input data:
code,id,mode,location,status,comment
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received
Expected output:
AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463
I have tried it in spark-sql, the query I am using is given below:
val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")
However, I'm getting the following error:
org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0
You can use regexp_extract which has the definition :
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex
df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id| mode| location|status| comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657|
|FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856|
|TY-OP|103|Airways| Pune| D|Order number R546...| R5463|
+-----+---+-------+---------+------+--------------------+-------+
You can use a udf function as following
import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)
where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.
Updated
To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter
import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)
You can edit the filter according to your need.