Inverse of pyspark.sql.functions greatest - pyspark

Is there an inverse to the function greatest?
Something to get the min of multiple columns?
If there is not, do you know any other way to find it than using udf functions?
Thank you!

The inverse is:
pyspark.sql.functions.least(*cols)
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
>>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c'])
>>> df.select(least(df.a, df.b, df.c).alias("least")).collect()
[Row(least=1)]

Related

Polars searchsorted with a Series

searchsorted is an incredibly useful utility in numpy and pandas for performing a binary search on every element in a list, especially for time-series data.
import numpy as np
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]
I have a few questions about Polars
Is there any way to apply search_sorted on a list in polars in a vectorized manner?
Is there any way to specify side=right for search_sorted?
Can we use non-numeric data in search_sorted?
If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?
(The ideal approach is if search_sorted can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B)))
Here's what I have tried:
import polars as pl
pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8
pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented
list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results
pl.DataFrame({
'a': [0, 0, 1, 2],
'b': [0, 1, 2, 3],
}).with_columns([
pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect
I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!
To extend on #ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately Utf8 dtypes so for your example you have to do an extra step.
Starting from:
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
then we make a df to house all the possible values of a and map them to a numeric.
dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
.sort('a').with_row_count('valindx'))
now we add that valindx to each of df1 and df2
df1=df1.join(dfindx, on='a')
df2=df2.join(dfindx, on='a')
To get almost to the finish line you'd do:
df2.join_asof(df1, on='valindx', strategy='forward')
this will leave missing the last value, the 4 from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.
(df2.
join_asof(df1, on='valindx', strategy='forward')
.with_column(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1)[0,0]))
.get_column('idx'))
Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof.
If you change the strategy of join_asof then that should be largely the same as switching the side but you'd have to change the hack bit at the end too.
EDIT: I added the requested functionality and it will be available in next release: https://github.com/pola-rs/polars/pull/6083
Old answer (wrong)
For a "normal" search sorted we can use a join.
# convert to DataFrame
# provide polars with the information the data is sorted (this speeds up many algorithms)
# set a row count
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
# join
# drop duplicates
# and only show the indices that were joined
df1.join(df2, on="a", how="semi").unique(subset=["a"])["idx"]
Series: 'idx' [u32]
[
0
2
3
]

How to read csv in pyspark as different types, or map dataset into two different types

Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
print(reducedCOVID.take(1))
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
Thanks!
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()

Using PySpark Imputer on grouped data

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))

Pyspark: Passing full dictionary to each task

PySpark: I want to pass my custom dictionary which contains the distances of several locations to each task in Pyspark as for each row in my rdd, I need to calculate the distances from each location and every location in dictionary and take the minimum distance. broadcast didnt solve my problem.
Example:
dict = {(a,3),(b,6),(c,2)}
RDD :
(location1, 5)
(location2, 9)
(location3, 8)
Output: (location1,1)
(location2,3)
(location3,2)
Please help and thanks
A broadcast variable will definitely solve your problem in this case, though you could also just pass the dictionary (or list--see below) in your map function. Whether using a broadcast variable is worthwhile depends on the size of the object.
First of all, since all you want is the minimum distance, it looks like you don't care about the keys of the dictionary, just the values. If that list is sorted, it will make it possible to find the minimum distance efficiently.
>>> d = {'a': 3, 'b': 6, 'c': 2}
>>> locations = sorted(d.itervalues())
>>> rdd = sc.parallelize([('location1', 5), ('location2', 9), ('location3', 8)])
Now define a function to find the minimum distance using bisect.bisect. We make a function of a single element from the general function using functools.partial to fix the second argument.
>>> from functools import partial
>>> from bisect import bisect
>>> def find_min_distance(loc, locations):
... ind = bisect(locations, loc)
... if ind == len(locations):
... return loc - locations[-1]
... elif ind == 0:
... return locations[0] - loc
... else:
... left_dist = loc - locations[ind - 1]
... right_dist = locations[ind] - loc
... return min(left_dist, right_dist)
>>> mapper = partial(find_min_distance, locations=locations)
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]
To do this instead with a broadcast variable:
>>> locations_bv = sc.broadcast(locations)
>>> def mapper(loc):
... return find_min_distance(loc, locations_bv.value)
...
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]

Encoding String to numbers so as to use it in scikit-learn

My data consists of 50 columns and most of them are strings. I have a single multi-class variable which I have to predict. I tried using LabelEncoder in scikit-learn to convert the features (not classes) into whole numbers and feed them as input to the RandomForest model I am using. I am using RandomForest for classification.
Now, when new test data comes (stream of new data), for each column, how will I know what the label for each string will be since using LabelEncoder now will give me a new label independent of the labels I generated before. Am, I doing this wrong? Is there anything else I should use for consistent encoding?
The LabelEncoder class has two methods that handle this distinction: fit and transform. Typically you call fit first to map some data to a set of integers:
>>> le = LabelEncoder()
>>> le.fit(['a', 'e', 'b', 'z'])
>>> le.classes_
array(['a', 'b', 'e', 'z'], dtype='U1')
Once you've fit your encoder, you can transform any data to the label space, without changing the existing mapping:
>>> le.transform(['a', 'e', 'a', 'z', 'a', 'b'])
[0, 2, 0, 3, 0, 1]
>>> le.transform(['e', 'e', 'e'])
[2, 2, 2]
The use of this encoder basically assumes that you know beforehand what all the labels are in all of your data. If you have labels that might show up later (e.g., in an online learning scenario), you'll need to decide how to handle those outside the encoder.
You could save the mapping: string -> label in training data with each column.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> col_1 = ["paris", "paris", "tokyo", "amsterdam"]
>>> set_col_1 = list(set(col_1))
>>> le.fit(col_1)
>>> dict(zip(set_col_1, le.transform(set_col_1)))
{'amsterdam': 0, 'paris': 1, 'tokyo': 2}
When the testing data come, you could use those mapping to encode corresponding columns in testing data. You do not have to use encoder again in testing data.