adding new column to my spark dataframe , and calculat the sum() - pyspark

AttributeError: 'DataFrame' object has no attribute '_get_object_id'

First of all: It is really important that you give us a reproducible example of your dataframe. Nobody likes to look at screenshots to identify an error.
Your code is not working because spark can't determine how the rows of your groupby and your initial dataframe can be merge. It isn't aware of that NUM_TIERS is somekind of a key. Therefore you have to tell spark which column(s) should be used to merge the groupby and the initial dataframe.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('OBAAAA7K2KBBO' , 34),
('OBAAAA878000K' , 138 ),
('OBAAAA878A2A0' , 164 ),
('OBAAAA7K2KBBO' , 496),
('OBAAAA878000K' , 91)]
columns = ['NUM_TIERS', 'MONTAN_TR']
df=spark.createDataFrame(l, columns)
You have to options to do that. You can use a join:
df = df.join(df.groupby('NUM_TIERS').sum('MONTAN_TR'), 'NUM_TIERS')
df.show()
OR a window function:
w = Window.partitionBy('NUM_TIERS')
df = df.withColumn('SUM', F.sum('MONTAN_TR').over(w))
Output is the same for both ways:
+-------------+---------+---+
| NUM_TIERS|MONTAN_TR|SUM|
+-------------+---------+---+
|OBAAAA7K2KBBO| 34|530|
|OBAAAA7K2KBBO| 496|530|
|OBAAAA878000K| 138|229|
|OBAAAA878000K| 91|229|
|OBAAAA878A2A0| 164|164|
+-------------+---------+---+

Related

How to get the common value by comparing two pyspark dataframes

I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')

Applying a function in each row of a big PySpark dataframe?

I have a big dataframe (~30M rows). I have a function f. The business of f is to run through each row, check some logics and feed the outputs into a dictionary. The function needs to be performed row by row.
I tried:
dic = dict() for row in df.rdd.collect(): f(row, dic)
But I always meet the error OOM. I set the memory of Docker to 8GB.
How can I effectively perform the business?
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import StringType, MapType
#sample data
df = sc.parallelize([
['a', 'b'],
['c', 'd'],
['e', 'f']
]).toDF(('col1', 'col2'))
#add logic to create dictionary element using rows of the dataframe
def add_to_dict(l):
d = {}
d[l[0]] = l[1]
return d
add_to_dict_udf = udf(add_to_dict, MapType(StringType(), StringType()))
#struct is used to pass rows of dataframe
df = df.withColumn("dictionary_item", add_to_dict_udf(struct([df[x] for x in df.columns])))
df.show()
#list of dictionary elements
dictionary_list = [i[0] for i in df.select('dictionary_item').collect()]
print dictionary_list
Output is:
[{u'a': u'b'}, {u'c': u'd'}, {u'e': u'f'}]
By using collect you pull all the data out of the Spark Executors into your Driver. You really should avoid this, as it makes using Spark pointless (you could just use plain python in that case).
What could you do:
reimplement your logic using functions already available: pyspark.sql.functions doc
if you cannot do the first, because there is functionality missing, you can define a User Defined Function

Creating a new column by applying a function in an existing column in PySpark?

Say I have a dataframe
product_id customers
1 [1,2,4]
2 [1,2]
I want to create a new column, say nb_customer by applying the function len on the column customers.
I tried
df = df.select('*', (map(len, df.customers)).alias('nb_customer'))
but it does not work.
What is the correct way to do that?
import pyspark.sql.functions as f
df = sc.parallelize([
[1,[1,2,4]],
[2,[1,2]]
]).toDF(('product_id', 'customers'))
df.withColumn('nb_customer',f.size(df.customers)).show()

Add a column to an existing dataframe with random fixed values in Pyspark

I'm new to Pyspark and I'm trying to add a new column to my existing dataframe. The new column should contain only 4 fixed values (e.g. 1,2,3,4) and I'd like to randomly pick one of the values for each row.
How can I do that?
Pyspark dataframes are immutable, so you have to return a new one (e.g. you can't just assign to it the way you can with Pandas dataframes). To do what you want use a udf:
from pyspark.sql.functions import udf
import numpy as np
df = <original df>
udf_randint = udf(np.random.randint(1, 4))
df_new = df.withColumn("random_num": udf_randint)

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28