Pyspark join two tables based on distance - pyspark

I have two tables store and weather_station. to find the closes weather station from a particular store and create a new table with this details I am using following code.
def closest(weather_station, store):
return min(weather_station, key=lambda p: distance(store['lat'], store['lon'], p['lat'], p['lon']))
for store in store_details:
print store
print closest(weather_station_details, store)
it works fine without any issue. if I run with this sample data, I get the correct result.
weather_station_details=[
{'date': '2018-03-06T13:00:00.000Z', 'station_cd': 'CYGK', 'station_nm': 'Kingston', 'lat': 44.22587, 'lon': -76.5966},
{'date': '2018-03-06T13:00:00.000Z', 'station_cd': 'CXOA', 'station_nm': 'OTTAWA CDA RCS', 'lat': 45.383333, 'lon': -75.716667},
{'date': '2018-03-06T13:00:00.000Z', 'station_cd': 'CYUL', 'station_nm': 'Montreal/Trudeau International', 'lat': 45.47046, 'lon': -73.74093},
{'date': '2018-03-06T13:00:00.000Z', 'station_cd': 'CYYC', 'station_nm': 'Calgary International', 'lat': 51.12262, 'lon': -114.01335},
{'date': '2018-03-06T12:00:00.000Z', 'station_cd': 'CPEA', 'station_nm': 'EDGERTON AGCM', 'lat': 52.783333, 'lon': -110.433333},
{'date': '2018-03-06T12:00:00.000Z', 'station_cd': 'CPEH', 'station_nm': 'ENCHANT AGDM', 'lat': 50.183333, 'lon': -112.433333},
{'date': '2018-03-06T12:00:00.000Z', 'station_cd': 'CPGE', 'station_nm': 'GILT EDGE NORTH AGCM', 'lat': 53.066667, 'lon': -110.616667},
{'date': '2018-03-06T12:00:00.000Z', 'station_cd': 'CPHU', 'station_nm': 'HUGHENDEN AGCM AB', 'lat': 52.583333, 'lon': -110.783333},
{'date': '2018-03-06T12:00:00.000Z', 'station_cd': 'CPIR', 'station_nm': 'IRON SPRINGS AGDM', 'lat': 49.9, 'lon': -112.733333},
]
store_details=[
{'lon': -113.99361, 'store_num': 'A111', 'lat': 51.201838},
{'lon': -73.792339, 'store_num': 'A222', 'lat': 45.53343},
{'lon': -75.699475, 'store_num': 'A333', 'lat': 45.475785},
{'lon': -76.564509, 'store_num': 'A444', 'lat': 44.244361},
]
However, as data is huge and to get performance I am trying to use pyspark. However I am stuck. I can't pass the one data frame as a argument to function or can't make it global.
is there anyway I can achieve this in pyspark?

There are are least a couple of ways to do this. I'm providing only outlines here.
One approach:
Define a UDF distance to calculate the distance between an arbitrary store and weather station pair.
Perform a cartesian join of your stores dataframe with your weather_stations dataframe. If one of these dataframes is small (a few MB), you could force it to be a broadcast join. (Warning: this will result in a dataframe with a size of M x N, where M and N are the size of the two original dataframes. This could easily blow up your available storage.)
Use your UDF to calculate the distance between each store / station pair.
Either group by store or use a window function partitioned by store to select the weather station with minimum distance.
Another approach:
Define a UDF min_distance that takes a store and finds the weather station with the minimum distance. Again, if the list of weather stations is reasonably small, then it would be appropriate to broadcast the weather stations data structure to speed up this step.
Apply this UDF to your stores dataframe.

Related

How to get the list of common friends between any pair of friends in the network using pyspark

I have three records like this
[('a1', ['b1', 'c1', 'd1', 'e1']), ('a2', ['b1', 'c2', 'd2', 'e1']),('a3', ['b1', 'c2', 'd1', 'e2'])]
containing the id for key as the list of the values for each key.
I want to get the total number of values in the list for each key in pyspark.
How I can get the list of the common friends between any pair of fiends in pyspark.
1 Simply use the size function.
df = df.withColumn('num_friends', F.expr('size(friends)'))
2 Use the array_intersect function to get the intersection of arrays.
cp_df = df.toDF('key_pair', 'friends_pair')
cross_df = df.crossJoin(cp_df).filter('key!=key_pair')
cross_df = cross_df.select(F.create_map('key', 'key_pair').alias('key_pair'),
F.array_intersect('friends', 'friends_pair').alias('common_friends'))
cross_df.show(truncate=False)

MongoDB: Get whole document with aggregate method

I'm trying to reach something like this:
I have collections of activities that belong to some user.
I want to get the activity names distincted ordered by 'added_time', so I used 'group by' on the activity name and get the max value of 'added_time'.
Also, I want to sort them by 'added_time', and then to get the whole document.
The only thing that I reached so far, is to get only the name that I grouped by, and the 'added_time' property.
This is the query:
db.getCollection('user_activities').aggregate
(
{$match: {'type': 'food', 'user_id': '123'}},
{$group:{'_id':'$name', 'added_time':{$max:'$added_time'}}},
{$sort:{'added_time':-1}},
{$project: {_id: 0,name: "$_id",count: 1,sum: 1, 'added_time': 1}}
)
Can someone help me with reaching the whole document?
Thank's!

Is {key: {$in: ["Only One Element"]}} equivalent to {key: "Only One Element"}

Is the MongoDB query
{key: {$in: ["Only One Element"]}}
equivalent to
{key: "Only One Element"}
in terms of performance?
A wrote a simple test script: https://gist.github.com/nellessen/5f2d36de4ef74b5a34b3
with the following result:
Average milis for $in with index: 0.12271258831
Average milis for simple query with index: 0.114794611931
Average milis for $in without index: 1.51113886833
Average milis for simple query without index: 1.40885763168
So as a result {key: {$in: ["Only One Element"]}} is 7% slower than {key: "Only One Element"}!

Indexing two fields in mogo: timestamp and text

I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.
Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().

MongoDB - Pagination based on non-unique fields

I am familiar with the best practice of range based pagination on large MongoDB collections, however I am struggling with figuring out how to paginate a collection where the sort value is on a non-unique field.
For example, I have a large collection of users, and there is a field for the number of times they have done something. This field is defintely non-unique, and could have large groups of documents that have the same value.
I would like to return results sorted by that 'numTimesDoneSomething' field.
Here is a sample data set:
{_id: ObjectId("50c480d81ff137e805000003"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000005"), numTimesDoneSomething: 9}
{_id: ObjectId("50c480d81ff137e805000006"), numTimesDoneSomething: 7}
{_id: ObjectId("50c480d81ff137e805000007"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000002"), numTimesDoneSomething: 15}
{_id: ObjectId("50c480d81ff137e805000008"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000009"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000004"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000010"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000011"), numTimesDoneSomething: 1}
How would I return this data set sorted by 'numTimesDoneSomething' with 2 records per page?
#cubbuk shows a good example using offset (skip) but you can also mould the query he shows for ranged pagination as well:
db.collection.find().sort({numTimesDoneSomething:-1, _id:1})
Since the _id here will be unique and you are seconding on it you can actually then range by _id and the results, even between two records having numTimesDoneSomething of 12, should be consistent as to whether they should be on one page or the next.
So doing something as simple as
var q = db.collection.find({_id: {$gt: last_id}}).sort({numTimesDoneSomething:-1, _id:1}).limit(2)
Should work quite good for ranged pagination.
You can sort on multiple fields in this case sort on numTimesDoneSomething and id field. Since id_ field is ascending in itself already according to the insertion timestamp, you will able to paginate through the collection without iterating over duplicate data unless new data is inserted during the iteration.
db.collection.find().sort({numTimesDoneSomething:-1, _id:1}).offset(index).limit(2)