TypeError: filter() got an unexpected keyword argument

TypeError: filter() got an unexpected keyword argument - pyspark

I am trying to filter the rows that have an specific date on a dataframe. they are in the form of month and day but I keep getting different errors. Not sure what is happening of how to solve it.
This is how my table looks like
And this is how I am trying to filter the Date_Created rows for Jan 21:
df4 = df3.select("*").filter(Date_Created = 'Jan 21')
I am getting this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-a4124a5c0058> in <module>()
----> 1 df4 = df3.select("*").filter(Date_Created = 'Jan 21')
TypeError: filter() got an unexpected keyword argument 'Date_Created'
I tried also changing to double quotes and using '' in the name of the column but nothing is working... I am kind of guessing right now...

You could use df.filter(df["Date_Created"] == "Jan 21")
Here's an example:
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
df = spark.createDataFrame(
[
(1, "Jan 21", 566),
(2, "Nov 22", 234),
(3, "Dec 1", 123),
(4, "Jan 21", 5466),
(5, "Jan 21", 4566),
(3, "Dec 4", 123),
(3, "Dec 2", 123),
],
["id", "Date_Created", "Number"],
)
df = df.filter(df["Date_Created"] == "Jan 21")
Result:
+---+------------+------+
| id|Date_Created|Number|
+---+------------+------+
| 1| Jan 21| 566|
| 4| Jan 21| 5466|
| 5| Jan 21| 4566|
+---+------------+------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?

If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

pyspark replace column values with when function gives column object is not callable

I have a table like this
name
----
A
B
ccc
D
eee
and a list of valid names
legal_names = [A, B, D]
And I want to replace all illegal names with another string "INVALID".
I used this script:
(
df.withColumn(
"name",
F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
F.lit("INVALID")
),
)
)
But I get this error
TypeError: 'Column' object is not callable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File <command-4397929369165676>:4, in <cell line: 2>()
1 (
2 df.withColumn(
3 "name",
----> 4 F.when((F.col("name").isin(legal_names)), F.col("name")).otherwhise(
5 F.lit("INVALID")
6 ),
7 )
8 )
TypeError: 'Column' object is not callable
Dummy data to reproduce:
vals = [("A", ), ("B", ), ("ccc", ), ("D", ), ("EEE", )]
cols = ["name"]
legal_names = ["A", "B", "D"]
df = spark.createDataFrame(vals, cols)

Try using below code -
df1 = df.withColumn( "name", F.when( (F.col("name").isin(*legal_names)), F.col("name") ).otherwise(F.lit('INVALID')) )
Output :
+-------+
| name|
+-------+
| A|
| B|
|INVALID|
| D|
|INVALID|
+-------+

Selecting subset spark dataframe by months

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.

You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

Creating a new column using info from another df

I'm trying to create a new column based off information from another data table.
df1
Loc Time Wage
1 192 1
3 192 2
1 193 3
5 193 3
7 193 5
2 194 7
df2
Loc City
1 NYC
2 Miami
3 LA
4 Chicago
5 Houston
6 SF
7 DC
desired output:
Loc Time Wage City
1 192 1 NYC
3 192 2 LA
1 193 3 NYC
5 193 3 Houston
7 193 5 DC
2 194 7 Miami
The actual dataframes vary quite largely in terms of row numbers, but its something along the lines of that. I think this might be achievable through .map but I haven't found much documentation for that online. join doesn't really seem to fit this situation.

join is exactly what you need. Try running this in the spark-shell
import spark.implicits._
val col1 = Seq("loc", "time", "wage")
val data1 = Seq((1, 192, 1), (3, 193, 2), (1, 193, 3), (5, 193, 3), (7, 193, 5), (2, 194, 7))
val col2 = Seq("loc", "city")
val data2 = Seq((1, "NYC"), (2, "Miami"), (3, "LA"), (4, "Chicago"), (5, "Houston"), (6, "SF"), (7, "DC"))
val df1 = spark.sparkContext.parallelize(data1).toDF(col1: _*)
val df2 = spark.sparkContext.parallelize(data2).toDF(col2: _*)
val outputDf = df1.join(df2, Seq("loc")) // join on the column "loc"
outputDf.show()
This will output
+---+----+----+-------+
|loc|time|wage| city|
+---+----+----+-------+
| 1| 192| 1| NYC|
| 1| 193| 3| NYC|
| 2| 194| 7| Miami|
| 3| 193| 2| LA|
| 5| 193| 3|Houston|
| 7| 193| 5| DC|
+---+----+----+-------+

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?

Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.
import datetime
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+

Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.

The following will not use udf, but instead, a more efficient vectorized pandas_udf:
import pandas as pd
#F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

TypeError: filter() got an unexpected keyword argument - pyspark

Related

How to create a column with the maximum number in each row of another column in PySpark?

pyspark replace column values with when function gives column object is not callable

Selecting subset spark dataframe by months

Creating a new column using info from another df

to_date gives null on format yyyyww (202001 and 202053)

Categories

Resources