Use a list of filters within polars - python-polars

Is there a way to filter a polars DataFrame by multiple conditions?
This is my use case and how I currently solve it, but I wonder how to solve it, if my list of dates would be longer:
dates = ["2018-03-25", "2019-03-31", "2020-03-29"]
timechange_forward = [(datetime.strptime(x+"T02:00", '%Y-%m-%dT%H:%M'), datetime.strptime(x+"T03:01", '%Y-%m-%dT%H:%M')) for x in dates]
df.filter(
pl.col("time").is_between(*timechange_forward[0]) |
pl.col("time").is_between(*timechange_forward[1]) |
pl.col("time").is_between(*timechange_forward[2])
)

You could pass multiple conditions to .any()
df.filter(
pl.any(
pl.col("time").is_between(*time)
for time in timechange_forward
)
)

You haven't made your example reproducible, so it's hard to test this, but how about
import functools
import operator
conditions = [
pl.col("time").is_between(*val)
for val in timechange_forward
]
df.filter(functools.reduce(operator.or_, conditions))
?

Related

How to refer to columns containing f-strings in a Pyspark function?

I am writing a function for a Spark DF that performs operations on columns and gives them a suffix, such that I can run the function twice on two different suffixes and join them later.
I am having a time of figuring out the best way to refer to them however in this particular bit of code and was wondering what I am missing?
def calc_date(sdf, suffix):
final_sdf = (
sdf.withColumn(
f"lowest_days{suffix}",
f"sdf.list_of_days_{suffix}"[0],
)
.withColumn(
f"earliest_date_{suffix}",
f"sdf.list_of_dates_{suffix}"[0],
)
.withColumn(
f"actual_date_{suffix}",
spark_fns.expr(
f"date_sub(earliest_date_{suffix}, lowest_days{suffix})"
),
)
)
Here I am trying to pull the first value from two lists (list_of_days and list_of_dates) and perform a date calculation to create a new variable (actual_date).
I would like to do this in a function so that I don't have to do the same set of operations twice (or more) depending on the number of suffixes I have?
But the f-strings give an error col should be Column.
Any help on this would be greatly appreciated!
You need to wrap the second argument with a col().
from pyspark.sql.functions import *
def calc_date(sdf, suffix):
final_sdf = (
sdf.withColumn(
f"lowest_days{suffix}",
col(f"list_of_days_{suffix}")[0],
)
.withColumn(
f"earliest_date_{suffix}",
col(f"list_of_dates_{suffix}")[0],
)
)

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

How do I select columns from a Spark dataframe when I also need to use withColumnRenamed?

I have a dataframe of
df = df.select("employee_id", "employee_name", "employee_address")
I need to rename the first two fields, but also still select the third field. So I thought this would work, but this appears to only select employee_address.
df = (df.withColumnRenamed("employee_id", "empId")
.withColumnRenamed("employee_name", "empName")
.select("employee_address")
)
How do I properly rename the first two fields while also selecting the third field?
I tried a mix of withColumn usages, but that doesn't work. Do I have to use a select on all three fields?
You can use the alias command:
import pyspark.sql.functions as func
df = df.select(
func.col("employee_id").alias("empId"),
func.col("employee_name").alias("empName"),
func.col("employee_address")
)

finding the count, pass and fail percentage of the multiple values from single column while aggregating another column using pyspark

Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.

Pyspark dataframe LIKE operator

What is the equivalent in Pyspark for LIKE operator?
For example I would like to do:
SELECT * FROM table WHERE column LIKE "*somestring*";
looking for something easy like this (but this is not working):
df.select('column').where(col('column').like("*s*")).show()
You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%'). The col('col_name') is used to represent the condition and like is the operator:
df.where(col('col1').like("%string%")).show()
Using spark 2.0.0 onwards following also works fine:
df.select('column').where("column like '%s%'").show()
Use the like operator.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions
df.filter(df.column.like('%s%')).show()
To replicate the case-insensitive ILIKE, you can use lower in conjunction with like.
from pyspark.sql.functions import lower
df.where(lower(col('col1')).like("%string%")).show()
Well...there should be sql like regexp ->
df.select('column').where(col('column').like("%s%")).show()
This worked for me:
import pyspark.sql.functions as f
df.where(f.col('column').like("%x%")).show()
In pyspark you can always register the dataframe as table and query it.
df.registerTempTable('my_table')
query = """SELECT * FROM my_table WHERE column LIKE '*somestring*'"""
sqlContext.sql(query).show()
Using spark 2.4, to negate you can simply do:
df = df.filter("column not like '%bla%'")
Also CONTAINS can be used:
df = df.where(col("columnname").contains("somestring"))
I always use a UDF to implement such functionality:
from pyspark.sql import functions as F
like_f = F.udf(lambda col: True if 's' in col else False, BooleanType())
df.filter(like_f('column')).select('column')