Pyspark | map JSON rdd and apply broadcast - pyspark

In pyspark, how to transform an input RDD having JSON to the below specified output while applying the broadcast variable to a list of values?
Input
[{'id': 1, 'title': "Foo", 'items': ['a','b','c']}, {'id': 2, 'title': "Bar", 'items': ['a','b','d']}]
Broadcast variable
[('a': 5), ('b': 12), ('c': 42), ('d': 29)]
Desired Output
[(1, 'Foo', [5, 12, 42]), (2, 'Bar', [5, 12, 29])]

Edit: Originally I was under the impression that functions passed to map functions are automatically broadcast, but after reading some docs I am no longer sure of that.
In any case, you can define your broadcast variable:
bv = [('a', 5), ('b', 12), ('c', 42), ('d', 29)]
# turn into a dictionary
bv = dict(bv)
broadcastVar = sc.broadcast(bv)
print(broadcastVar.value)
#{'a': 5, 'c': 42, 'b': 12, 'd': 29}
Now it is available on all machines as a read-only variable. You can access the dictionary using broascastVar.value:
For example:
import json
rdd = sc.parallelize(
[
'{"id": 1, "title": "Foo", "items": ["a","b","c"]}',
'{"id": 2, "title": "Bar", "items": ["a","b","d"]}'
]
)
def myMapper(row):
# define the order of the values for your output
key_order = ["id", "title", "items"]
# load the json string into a dict
d = json.loads(row)
# replace the items using the broadcast variable dict
d["items"] = [broadcastVar.value.get(item) for item in d["items"]]
# return the values in order
return tuple(d[k] for k in key_order)
print(rdd.map(myMapper).collect())
#[(1, u'Foo', [5, 12, 42]), (2, u'Bar', [5, 12, 29])]

Related

Values of a Dataframe Column into an Array in Scala Spark

Say, I have dataframe
val df1 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A3", 45, "9", 1, 450),
("A4", 26, "7", 1, 333)
)).toDF("CID","age", "children", "marketplace_id","value")
Now I want all the values of column "children" into an separate array in the same order.
the below code works for smaller dataset with only one partition
val list1 = df.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list1: Array[String] = Array(5, 1, 9, 7)
But the above code fails when we have partitions
val partitioned = df.repartition($"CID")
val list = partitioned.select("children").map(r => r(0).asInstanceOf[String]).collect()
output:
list: Array[String] = Array(9, 1, 7, 5)
is there way, that I can get all the values of a column into an array without changing an order?

Is it possible to combine .agg(dictionary) and renaming the resulting column with .alias() in Pyspark?

I have a pyspark dataframe 'pyspark_df' I want to group the data and aggregate the data with a general function string name like one of the following :'avg', 'count', 'max', 'mean', 'min', or 'sum'.
I need the resulting aggregated name to be 'aggregated' regardless of the aggregation type.
I have been able to do this as follows.
seriesname = 'Group'
dateVar = 'as_of_date'
aggSeriesName = 'Balance'
aggType = 'sum'
name_to_be_Changed = aggType + '(' + aggSeriesName + ')'
group_sorted = pyspark_df.groupby(dateVar,seriesname).agg({aggSeriesName: aggType}).withColumnRenamed(name_to_be_Changed,'aggregated').toPandas()
However, is there a way to do this via .alias()? I have seen this used as follows
group_sorted = pyspark_df.groupby(dateVar,seriesname).agg(sum(aggSeriesName).alias('aggregated')).toPandas()
How do I use alias in a way that I don't have to type out the 'sum(aggSeriesName)' portion? Hopefully I am being clear.
I'm not sure why you are asking this question and can't therefore provide a proper alternative solution. As far as I know it is not possible to combine .agg(dictionary) and renaming the resulting column with .alias. withColumnRenamed is the way to go for this case.
What you also can do is applying a selectExpr:
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
aggSeriesName = 'age'
aggType = 'sum'
targetName = 'aggregated'
bla = vertices.selectExpr('{}({}) as {}'.format(aggType, aggSeriesName, targetName))
bla.show()
Output:
+----------+
|aggregated|
+----------+
| 257|
+----------+

Dropping rows from a spark dataframe based on a condition

I want to drop rows from a spark dataframe of lists based on a condition. The condition is the length of the list being a certain length.
I have tried converting it into a list of lists and then using a for loop (demonstrated below) but I'm hoping to do it in one statement within spark and just creating a new immutable df from the original df based on this condition.
newList = df2.values.tolist()
finalList = []
for subList in newList:
if len(subList) < 4:
finalList.append(subList)
So for instance, if the dataframe is a one column dataframe and the column is named sequences, it looks like:
sequences
____________
[1, 2, 4]
[1, 6, 3]
[9, 1, 4, 6]
I want to drop all rows where the length of the list is more than 3, resulting in:
sequences
____________
[1, 2, 4]
[1, 6, 3]
Here it is one approach in Spark >= 1.5 using the build-in size function:
from pyspark.sql import Row
from pyspark.sql.functions import size
df = spark.createDataFrame([Row(a=[9, 3, 4], b=[8,9,10]),Row(a=[7, 2, 6, 4], b=[2,1,5]), Row(a=[7, 2, 4], b=[8,2,1,5]), Row(a=[2, 4], b=[8,2,10,12,20])])
df.where(size(df['a']) <= 3).show()
Output:
+---------+------------------+
| a| b|
+---------+------------------+
|[9, 3, 4]| [8, 9, 10]|
|[7, 2, 4]| [8, 2, 1, 5]|
| [2, 4]|[8, 2, 10, 12, 20]|
+---------+------------------+

Pyspark | Transform RDD from key with list of values > values with list of keys

In pyspark, how to transform an input RDD where Every Key has a list of Values to an output RDD where Every Value has a list of Keys it belong to?
Input
[(1, ['a','b','c','e']), (2, ['b','d']), (3, ['a','d']), (4, ['b','c'])]
Output
[('a', [1, 3]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2,3]), ('e', [1])]
Flatten and swap the key value on the rdd first, and then groupByKey:
rdd.flatMap(lambda r: [(k, r[0]) for k in r[1]]).groupByKey().mapValues(list).collect()
# [('a', [1, 3]), ('e', [1]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2, 3])]

pyspark Sortby didn't work on multiple values?

Suppose I have rdd contain data of 4 tuples (a,b,c,d) in which that a,b,c,and d are all integer variable
I'm try to sort data on assending order based on only d variable ( but it not finalized so I try to do something else )
This is current code I type
sortedRDD = RDD.sortBy(lambda (a, b, c, d): d)
however I check the finalize data but it seem that the result is still not corrected
# I check with this code
sortedRDD.takeOrdered(15)
You should specify the sorting order again in takeOrdered:
RDD.takeOrdered(15, lambda (a, b, c, d): d)
As you do not collect the data after the sort, the order is not guaranteed in subsequent operations, see the example below:
rdd = sc.parallelize([(1,2,3,4),(5,6,7,3),(9,10,11,2),(12,13,14,1)])
result = rdd.sortBy(lambda (a,b,c,d): d)
#when using collect output is sorted
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.collect
#results are not necessarily sorted in subsequent operations
#[(1, 2, 3, 4), (5, 6, 7, 3), (9, 10, 11, 2), (12, 13, 14, 1)]
result.takeOrdered(5)
#result are sorted when specifying the sort order in takeOrdered
#[(12, 13, 14, 1), (9, 10, 11, 2), (5, 6, 7, 3), (1, 2, 3, 4)]
result.takeOrdered(5,lambda (a,b,c,d): d)