How can I achieve functionality similar to pandas.reindex(new_index, method="ffill") with a datetime column in polars? - python-polars

In Pandas I can add new rows by their index and forward fill in values without filling any other nulls in the DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame(data={"a": [1.0, 2.0, np.nan, 3.0]}, index=pd.date_range("2020", periods=4, freq="T"))
print(df)
df = df.reindex(index=df.index.union(pd.date_range("2020-01-01 00:01:30", periods=2, freq="T")), method="ffill")
print(df)
Giving output
a
2020-01-01 00:00:00 1.0
2020-01-01 00:01:00 2.0
2020-01-01 00:02:00 NaN
2020-01-01 00:03:00 3.0
a
2020-01-01 00:00:00 1.0
2020-01-01 00:01:00 2.0
2020-01-01 00:01:30 2.0
2020-01-01 00:02:00 NaN
2020-01-01 00:02:30 NaN
2020-01-01 00:03:00 3.0
Is it possible to achieve something similar using Polars? I am using Polars mainly because it has better performance for my data so far, so performance matters.
I can think of concat -> sort -> ffill methods, something like:
let new_index_values = new_index_values.into_series().into_frame();
let new_index_values_len = new_index_values.height();
let mut cols = vec![new_index_values];
let col_names = source.get_column_names();
for col_name in col_names.clone() {
if col_name != index_column {
cols.push(
Series::full_null(
col_name,
new_index_values_len,
source.column(col_name)?.dtype(),
)
.into_frame(),
)
}
}
let range_frame = hor_concat_df(&cols)?.select(col_names)?;
concat([source.clone().lazy(), range_frame.lazy()], true, true)?
.sort(
index_column,
SortOptions {
descending: false,
nulls_last: true,
},
)
.collect()?
.fill_null(FillNullStrategy::Forward(Some(1)))?
.unique(Some(&[index_column.into()]), UniqueKeepStrategy::Last)
but this will fill other nulls than the ones that were added. I need to preserve the nulls in the original data, so that does not work for me.

I'm not familiar with Rust so this would be the python way to do it (or at least how I would approach it).
Starting with:
pldf = pl.DataFrame({
"dt":pl.date_range(datetime(2020,1,1), datetime(2020,1,1,0,3), "1m"),
"a": [1.0, 2.0, None, 3.0]
})
and then you want to add
new_rows = pl.DataFrame({
"dt": pl.date_range(datetime(2020,1,1,0,1,30), datetime(2020,1,1,0,2,30), "1m")
})
All I've done is convert the pandas date_range syntax to the polars one.
To put those together, use a join_asof. Since these Frames were constructed with date_range, they're already in order but if real data is constructed a different way, ensure you sort them first.
new_rows = new_rows.join_asof(pldf, on='dt')
This just gives you the actual new_rows and then you can concat them together to get to your final answer.
pldf = pl.concat([pldf, new_rows]).sort('dt')

Related

Pyspark calculate average of non-zero elements for each column

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(0.0, 1.2, -1.3), (0.0, 0.0, 0.0),
(-17.2, 20.3, 15.2), (23.4, 1.4, 0.0),],
['col1', 'col2', 'col3'])
df1 = df.agg(F.avg('col1'))
df2 = df.agg(F.avg('col2'))
df3 = df.agg(F.avg('col3'))
If I have a dataframe,
ID COL1 COL2 COL3
1 0.0 1.2 -1.3
2 0.0 0.0 0.0
3 -17.2 20.3 15,2
4 23.4 1.4 0.0
I want to calculate mean for each column.
avg1 avg2 avg3
1 3.1 7.6 6.9
The result of above code is 1.54, 5.725, 3.47, which includes zero elements during averaging.
How can I do it?
None values are not affecting average so if you turn zero values to null you can have average of none zero values
(
df
.agg(
F.avg(F.when(F.col('col1') == 0, None).otherwise(F.col('col1'))).alias('avg(col1)'),
F.avg(F.when(F.col('col2') == 0, None).otherwise(F.col('col2'))).alias('avg(col2)'),
F.avg(F.when(F.col('col3') == 0, None).otherwise(F.col('col3'))).alias('avg(col3)'))
).show()

Spark scala aggregate to an array and concat it

I have a Dataset with a number of columns that looks like this:(Columns -name, timestamp, platform, clickcount, id)
Joy 2021-10-10T10:27:16 apple 5 1
May 2020-12-12T22:28:08 android 6 2
June 2021-09-15T20:20:06 Microsoft 9 3
Joy 2021-09-09T09:30:09 android 10 1
May 2021-08-08T05:05:05 apple 8 2
I want to group by id and after it should look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2
June 2021-09-15T20:20:06 Microsoft 9 3
After calling for another Api which converts the id to pseudo id I want to map that id and make it to look like
Joy 2021-10-10T10:27:16,2021-09-09T09:30:09 apple,android 5,10 1 A12
May 2020-12-12T22:28:08,2021-08-08T05:05:05 android,apple 6,8 2 B23
June 2021-09-15T20:20:06 Microsoft 9 3 C34
I have tried using groupBy and forEach but I am stuck and unable to proceed further
In order to apply the aggregation you want, you should use collect_set as the aggregation function and concat_ws in order to join with comma the created arrays:
import org.apache.spark.sql.functions.{collect_set, concat_ws}
import spark.implicits._
val df: DataFrame = Seq(
("joy", "2021-10-10T10:27:16", "apple", 5, 1),
("may", "2020-12-12T22:28:08", "android", 6, 2),
("june", "2021-09-15T20:20:06", "microsoft", 9, 3),
("joy", "2021-09-09T09:30:09", "android", 10, 1),
("may", "2021-08-08T05:05:05", "apple", 8, 2)
).toDF("name", "timestamp", "platform", "clickcount", "id")
df
.groupBy("id")
.agg(
concat_ws(",", collect_set("timestamp")).as("timestamp"),
concat_ws(",", collect_set("name")).as("name"),
concat_ws(",", collect_set("platform")).as("platform"),
concat_ws(",", collect_set("clickcount")).as("clickcount")
).show()
The output should be:
+---+--------------------+----+-------------+----------+
| id| timestamp|name| platform|clickcount|
+---+--------------------+----+-------------+----------+
| 1|2021-10-10T10:27:...| joy|apple,android| 5,10|
| 3| 2021-09-15T20:20:06|june| microsoft| 9|
| 2|2021-08-08T05:05:...| may|apple,android| 6,8|
+---+--------------------+----+-------------+----------+
In order to add a pseudo id column, you should join the created df with another dataframe that contains the conversion values or write an UDF that will receive an id value and will convert it into pseudo id.

Pyspark - How to concatenate columns of multiple dataframes into columns of one dataframe

I have multiple data frames (24 in total) with one column. I need to combine all of them to a single data frame. I created indexes and joined using indexes but it is quite slow to join all of them (All has same number of rows).
Please note that I'm using Pyspark 2.1
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('Index',row_number().over(w))
df2 = df2.withColumn('Index',row_number().over(w))
joined_df = df1.join(df2,df1.Index=df2.Index,'Inner').drop(df2.Index)
df3 = df3.withColumn('Index',row_number().over(w))
joined_df = joined_df.join(df3,joined_df.Index=df3.Index).drop(df3.Index)
But as the joined_df grows, it keeps getting slower
DF1:
Col1
2
8
18
12
DF2:
Col2
abc
bcd
def
bbc
DF3:
Col3
1.0
2.2
12.1
1.9
Expected Results:
joined_df:
Col1 Col2 Col3
2 abc 1.0
8 bcd 2.2
18 def 12.1
12 bbc 1.9
You're doing it the correct way. Unfortunately without a primary key, spark is not suited for this type of operation.
Answer by pault, pulled from comment.

pyspark: aggregate on the most frequent value in a column

aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city?
for example:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10
Would be grouped by income_bracket_10
Using a window function before aggregating might do the trick:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
you can also use a window function after aggregating since you're keeping a count of (city, income_bracket) occurrences.
You don't necessarily need Window functions:
aggregrated_table = (
df_input.groupby("city", "suburb","income_bracket")
.count()
.withColumn("count_income", F.array("count", "income_bracket"))
.groupby("city", "suburb")
.agg(F.max("count_income").getItem(1).alias("most_common_income_bracket"))
)
I think this does what you require. I don't really know if it performs better than the window based solution.
For pyspark version >=3.4 you can use the mode function directly to get the most frequent element per group:
from pyspark.sql import functions as f
df = spark.createDataFrame([
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
... schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(f.mode("year")).show()
+------+----------+
|course|mode(year)|
+------+----------+
| Java| 2012|
|dotNET| 2012|
+------+----------+
https://github.com/apache/spark/blob/7f1b6fe02bdb2c68d5fb3129684ca0ed2ae5b534/python/pyspark/sql/functions.py#L379
The solution by mfcabrera gave wrong results when F.max was used on F.array column as the values in ArrayType are treated as String and integer max didnt work as expected.
The below solution worked.
w = Window.partitionBy('city', "suburb").orderBy(f.desc("count"))
aggregrated_table = (
input_df.groupby("city", "suburb","income_bracket")
.count()
.withColumn("max_income", f.row_number().over(w2))
.filter(f.col("max_income") == 1).drop("max_income")
)
aggregrated_table.display()

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.
I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).
Prior to Spark 1.6, you could use a udf like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL function isnan() like this:
df.filter(isnan($"value"))
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
This works:
where isNaN(tau_doc) = false
e.g.
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")