Applying a function in each row of a big PySpark dataframe? - pyspark

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict() for row in df.rdd.collect(): f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?

from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType
#sample data
df = sc.parallelize([
['a', 'b'],
['c', 'd'],
['e', 'f']
]).toDF(('col1', 'col2'))
#add logic to create dictionary element using rows of the dataframe
def add_to_dict(l):
d = {}
d[l[0]] = l[1]
return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()
#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list
Output is:
[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]

By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).
What could you do:
reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function

Related

How can I replace selected columns at one using spark

I am trying to replace many columns at a time using pyspark, I am able to do it using below code, but it is iterating for each column name and when I have 100s of columns it is taking too much time.
from pyspark.sql.functions import col, when
df = sc.parallelize([(1,"foo","val","0","0","can","hello1","buz","oof"),
(2,"bar","check","baz","test","0","pet","stu","got"),
(3,"try","0","pun","0","you","omg","0","baz")]).toDF(["col1","col2","col3","col4","col5","col6","col7","col8","col9"])
df.show()
columns_for_replacement = ['col1','col3','col4','col5','col7','col8','col9']
replace_form = "0"
replace_to = "1"
for i in columns_for_replacement:
df = df.withColumn(i,when((col(i)==replace_form),replace_to).otherwise(col(i)))
df.show()
Can any one suggest how to replace all selected columns at once?

How to get the common value by comparing two pyspark dataframes

I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')

adding new column to my spark dataframe , and calculat the sum()

AttributeError: 'DataFrame' object has no attribute '_get_object_id'
First of all: It is really important that you give us a reproducible example of your dataframe. Nobody likes to look at screenshots to identify an error.
Your code is not working because spark can't determine how the rows of your groupby and your initial dataframe can be merge. It isn't aware of that NUM_TIERS is somekind of a key. Therefore you have to tell spark which column(s) should be used to merge the groupby and the initial dataframe.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('OBAAAA7K2KBBO' , 34),
('OBAAAA878000K' , 138 ),
('OBAAAA878A2A0' , 164 ),
('OBAAAA7K2KBBO' , 496),
('OBAAAA878000K' , 91)]
columns = ['NUM_TIERS', 'MONTAN_TR']
df=spark.createDataFrame(l, columns)
You have to options to do that. You can use a join:
df = df.join(df.groupby('NUM_TIERS').sum('MONTAN_TR'), 'NUM_TIERS')
df.show()
OR a window function:
w = Window.partitionBy('NUM_TIERS')
df = df.withColumn('SUM', F.sum('MONTAN_TR').over(w))
Output is the same for both ways:
+-------------+---------+---+
| NUM_TIERS|MONTAN_TR|SUM|
+-------------+---------+---+
|OBAAAA7K2KBBO| 34|530|
|OBAAAA7K2KBBO| 496|530|
|OBAAAA878000K| 138|229|
|OBAAAA878000K| 91|229|
|OBAAAA878A2A0| 164|164|
+-------------+---------+---+

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

Add a column to an existing dataframe with random fixed values in Pyspark

I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4) and I'd like to randomly pick one of the values for each row.
How can I do that?
Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf:
from pyspark.sql.functions import udf
import numpy as np
df = <original df>
udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)