Pyspark: Passing full dictionary to each task - pyspark

PySpark: I want to pass my custom dictionary which contains the distances of several locations to each task in Pyspark as for each row in my rdd, I need to calculate the distances from each location and every location in dictionary and take the minimum distance. broadcast didnt solve my problem.
Example:
dict = {(a,3),(b,6),(c,2)}
RDD :
(location1, 5)
(location2, 9)
(location3, 8)
Output: (location1,1)
(location2,3)
(location3,2)
Please help and thanks

A broadcast variable will definitely solve your problem in this case, though you could also just pass the dictionary (or list--see below) in your map function. Whether using a broadcast variable is worthwhile depends on the size of the object.
First of all, since all you want is the minimum distance, it looks like you don't care about the keys of the dictionary, just the values. If that list is sorted, it will make it possible to find the minimum distance efficiently.
>>> d = {'a': 3, 'b': 6, 'c': 2}
>>> locations = sorted(d.itervalues())
>>> rdd = sc.parallelize([('location1', 5), ('location2', 9), ('location3', 8)])
Now define a function to find the minimum distance using bisect.bisect. We make a function of a single element from the general function using functools.partial to fix the second argument.
>>> from functools import partial
>>> from bisect import bisect
>>> def find_min_distance(loc, locations):
... ind = bisect(locations, loc)
... if ind == len(locations):
... return loc - locations[-1]
... elif ind == 0:
... return locations[0] - loc
... else:
... left_dist = loc - locations[ind - 1]
... right_dist = locations[ind] - loc
... return min(left_dist, right_dist)
>>> mapper = partial(find_min_distance, locations=locations)
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]
To do this instead with a broadcast variable:
>>> locations_bv = sc.broadcast(locations)
>>> def mapper(loc):
... return find_min_distance(loc, locations_bv.value)
...
>>> rdd.mapValues(mapper).collect()
[('location1', 1), ('location2', 3), ('location3', 2)]

Related

PySpark Cosine Similarity between two vectors of TF-IDF values the Cosine Similarity using SparseMatrix + koalas or Pandas API on Spark

I do try to implement this Name Matching Cosine Similarity approach/functions get_matches_df in pyspark and pandas_on_spark(koalas) and struggling with optimizing this function (I do try to avoid conversion toPandas() for dataframes because will overload driver so I want to optimize this function and to scale it, so basically a batch approach will work perfect as in this example, or use pandas_udfs or simple UDFs that takes 1 vector and 2 dataframes:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)
this is the function I do work on optimizing (everything else I converted and created custom tfidfvectorizer, scaling cosine, pyspark sparsematrix generator and all I have left to optimize is this part (because uses loc and not sure how does work, I don't mind to have it behave as pandas aka all dataframe to driver but ideally will be
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'left_side': left_side,
'right_side': right_side,
'similairity': similairity})

how to apply "Gather" operation like numpy in Caffe2?

I am new to Caffe2, and I want to compose an operation like this:
Numpy way
example code
pytoch way
example code
My question is, how to compose Caffe2 operators to make the same operators like above? I have tried some compositions but still I couldn't find the right one. If anyone knows the composition, please help, I will be really appreciate for it.
There is a Gather operator in Caffe2. The main problem with this operator is that you can't set the axis (it's always 0). So, if we run this code:
model = ModelHelper(name="test")
s = np.arange(20).reshape(4, 5)
y = np.asarray([0, 1, 2])
workspace.FeedBlob('s', s.astype(np.float32))
workspace.FeedBlob('y', y.astype(np.int32))
model.net.Gather(['s', 'y'], ['out'])
workspace.RunNetOnce(model.net)
out = workspace.FetchBlob('out')
print(out)
We will get:
[[ 0. 1. 2. 3. 4.]
[ 5. 6. 7. 8. 9.]
[ 10. 11. 12. 13. 14.]]
One solution could be to reshape s to a 1D array and transform y in the same way. First of all, we have to implement an operator to transform y. In this case, we will use a numpy function called ravel_multi_index:
class RavelMultiIndexOp(object):
def forward(self, inputs, outputs):
blob_out = outputs[0]
index = np.ravel_multi_index(inputs[0].data, inputs[1].shape)
blob_out.reshape(index.shape)
blob_out.data[...] = index
Now, we can reimplement our original code:
model = ModelHelper(name="test")
s = np.arange(20).reshape(4, 5)
y = np.asarray([[0, 1, 2],[0, 1, 2]])
workspace.FeedBlob('s', s.astype(np.float32))
workspace.FeedBlob('y', y.astype(np.int32))
model.net.Python(RavelMultiIndexOp().forward)(
['y', 's'], ['y'], name='RavelMultiIndex'
)
model.net.Reshape('s', ['s_reshaped', 's_old'], shape=(-1, 1))
model.net.Gather(['s_reshaped', 'y'], ['out'])
workspace.RunNetOnce(model.net)
out = workspace.FetchBlob('out')
print(out)
Output:
[[ 0.]
[ 6.]
[ 12.]]
You may want to reshape it to (1, -1).

Using PySpark Imputer on grouped data

I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imputer.fit(grouped_data)
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
subsets.append(imputed_subset)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
.agg(F.avg('Age'))

Inverse of pyspark.sql.functions greatest

Is there an inverse to the function greatest?
Something to get the min of multiple columns?
If there is not, do you know any other way to find it than using udf functions?
Thank you!
The inverse is:
pyspark.sql.functions.least(*cols)
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
>>> df = spark.createDataFrame([(1, 4, 3)], ['a', 'b', 'c'])
>>> df.select(least(df.a, df.b, df.c).alias("least")).collect()
[Row(least=1)]

find the two highest factors of a single number that are closest to each other

36-> 6*6 (not 9*4)
40-> 5*8 (not 10*4)
35-> 7*5
etc
I'm guessing something like:
candidate = input.square_root.round_to_nearest_int;
while (true){
test = input/candidate;
if (test.is_integer) return;
else
candidate.decrement;
}
Your approach does work.
If n = ab then a <= sqrt(n) <= b, hence if a,b are chosen so that b-a is minimized, it follows that a is the largest divisor of n which is less than or equal to the square root. The only tweak I would make to your pseudocode is to check the remainder and see if it is zero, rather than checking if the quotient is an integer. Something like (in Python):
import math
def closestDivisors(n):
a = round(math.sqrt(n))
while n%a > 0: a -= 1
return a,n//a
For example,
>>> closestDivisors(36)
(6, 6)
>>> closestDivisors(40)
(5, 8)
>>> closestDivisors(1000003)
(1, 1000003)
(since the last input is prime).