TypeError: filter() got an unexpected keyword argument - pyspark

I am trying to filter the rows that have an specific date on a dataframe. they are in the form of month and day but I keep getting different errors. Not sure what is happening of how to solve it.
This is how my table looks like
And this is how I am trying to filter the Date_Created rows for Jan 21:
df4 = df3.select("*").filter(Date_Created = 'Jan 21')
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-a4124a5c0058> in <module>()
----> 1 df4 = df3.select("*").filter(Date_Created = 'Jan 21')
TypeError: filter() got an unexpected keyword argument 'Date_Created'
I tried also changing to double quotes and using '' in the name of the column but nothing is working... I am kind of guessing right now...

You could use df.filter(df["Date_Created"] == "Jan 21")
Here's an example:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
df = spark.createDataFrame(
[
(1, "Jan 21", 566),
(2, "Nov 22", 234),
(3, "Dec 1", 123),
(4, "Jan 21", 5466),
(5, "Jan 21", 4566),
(3, "Dec 4", 123),
(3, "Dec 2", 123),
],
["id", "Date_Created", "Number"],
)
df = df.filter(df["Date_Created"] == "Jan 21")
Result:
+---+------------+------+
| id|Date_Created|Number|
+---+------------+------+
| 1| Jan 21| 566|
| 4| Jan 21| 5466|
| 5| Jan 21| 4566|
+---+------------+------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

pyspark replace column values with when function gives column object is not callable

I have a table like this
name
----
A
B
ccc
D
eee
and a list of valid names
legal_names = [A, B, D]
And I want to replace all illegal names with another string "INVALID".
I used this script:
(
df.withColumn(
"name",
F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
F.lit("INVALID")
),
)
)
But I get this error
TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File <command-4397929369165676>:4, in <cell line: 2>()
1 (
2 df.withColumn(
3 "name",
----> 4 F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
5 F.lit("INVALID")
6 ),
7 )
8 )
TypeError: 'Column' object is not callable
Dummy data to reproduce:
vals = [("A", ), ("B", ), ("ccc", ), ("D", ), ("EEE", )]
cols = ["name"]
legal_names = ["A", "B", "D"]
df = spark.createDataFrame(vals, cols)
Try using below code -
df1 = df.withColumn( "name", F.when( (F.col("name").isin(*legal_names)), F.col("name") ).otherwise(F.lit('INVALID')) )
Output :
+-------+
| name|
+-------+
| A|
| B|
|INVALID|
| D|
|INVALID|
+-------+

Selecting subset spark dataframe by months

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.
You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

Creating a new column using info from another df

I'm trying to create a new column based off information from another data table.
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
desired output:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.
join is exactly what you need. Try running this in the spark-shell
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
This will output
+---+----+----+-------+
|loc|time|wage| city|
+---+----+----+-------+
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
+---+----+----+-------+

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?
Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.
import datetime
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+
Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.
The following will not use udf, but instead, a more efficient vectorized pandas_udf:
import pandas as pd
#F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+