pyspark: withcolumn Analysis error after column is converted to lowercase - pyspark

I have a dataframe that looks like below
+------------+------+
| food|pounds|
+------------+------+
| bacon| 4.0|
|STRAWBERRIES| 3.5|
| Bacon| 7.0|
|STRAWBERRIES| 3.0|
| BACON| 6.0|
|strawberries| 9.0|
|Strawberries| 1.0|
| pecans| 3.0|
+------------+------+
And the expected output is
+------------+------+---------+
| food|pounds|food_type|
+------------+------+---------+
| bacon| 4.0| meat|
|STRAWBERRIES| 3.5| fruit|
| Bacon| 7.0| meat|
|STRAWBERRIES| 3.0| fruit|
| BACON| 6.0| meat|
|strawberries| 9.0| fruit|
|Strawberries| 1.0| fruit|
| pecans| 3.0| other|
+------------+------+---------+
So I essentially defined a new_column based on my logic and applied that on .withcolumn
new_column = when((col('food') == 'bacon') | (col('food') == 'BACON') | (col('food') == 'Bacon'), 'meat'
).when((col('food') == 'STRAWBERRIES') | (col('food') == 'strawberries') | (col('food') == 'Strawberries'), 'fruit'
).otherwise('other')
And then
df.withColumn("food_type", new_column).show()
Which works fine. But I wanted to update the new_column statement with less code, so rewrite is as below
new_column = when(lower(col('food') == 'bacon') , 'meat'
).when(lower(col('food') == 'strawberries'), 'fruit'
).otherwise('other')
Now when I do df.withColumn("food_type", new_column).show()
I get error
AnalysisException: "cannot resolve 'CASE WHEN lower(CAST((`food` = 'bacon') AS STRING)) THEN 'meat' WHEN lower(CAST((`food` = 'strawberries') AS STRING)) THEN 'fruit' ELSE 'other' END' due to data type mismatch: WHEN expressions in CaseWhen should all be boolean type, but the 1th when expression's type is lower(cast((food#165 = bacon) as string));;\n'Project [food#165, pounds#166, CASE WHEN lower(cast((food#165 = bacon) as string)) THEN meat WHEN lower(cast((food#165 = strawberries) as string)) THEN fruit ELSE other END AS food_type#197]\n+- Relation[food#165,pounds#166] csv\n"
What am I missing?

Your parentheses are mismatched.
new_column = when(lower(col('food')) == 'bacon' , 'meat').when(lower(col('food')) == 'strawberries', 'fruit').otherwise('other')

I would like to share another approach which is more similar to SQL queries and can be more appropriate with more complex and nested conditions as well.
from pyspark.sql.functions import *
cond = """case when lower(food) in ('bacon') then 'meat'
else case when lower(food) in ('strawberries') then 'fruit'
else 'other'
end
end"""
newdf = df.withColumn("food_type", expr(cond))
Hope it helps.
Regards,
Neeraj

simplified:
new_column=when(lower(col("food"))=="bacon",'meat').when(lower(col("food"))=='strawberries','fruit').otherwise('other')
df.withColumn("food_type", new_column).show()

Related

how to access values from a array column in scala dataframe

I have a dataframe coming with scala array of tuples (index, value) like the following, index has values from 1 to 4
id | units_flag_tuples
id1 | [(3,2.0), (4,6.0)]
id2 | [(1,10.0), (2,2.0), (3,5.0)]
I would like to access the value from the array and put it into columns based on index (unit1, unit2, unit3, unit4):
ID | unit1| unit2 | unit3 | unit 4
id1 | null | null | 2.0 | 6.0
id2 | 10.0 | 2.0 | 5.0 | null
here is the code:
df
.withColumn("unit1", col("units_flag_tuples").find(_._1 == '1').get._2 )
.withColumn("unit2", col("units_flag_tuples").find(_._1 == '2').get._2 )
.withColumn("unit3", col("units_flag_tuples").find(_._1 == '3').get._2 )
.withColumn("unit4", col("units_flag_tuples").find(_._1 == '4').get._2 )
Here is the error message I am getting:
error: value find is not a member of org.apache.spark.sql.Column
How to resolve this error or any better ways to do it?
Here is my different approach, that I have used the map_from_entries function to make a map for array and get each column by choosing the key from the map.
val df = Seq(("id1", Seq((3,2.0), (4,6.0))), ("id2", Seq((1,10.0), (2,2.0), (3,5.0)))).toDF("id", "units_flag_tuples")
df.show(false)
df.withColumn("map", map_from_entries(col("units_flag_tuples")))
.withColumn("unit1", col("map.1"))
.withColumn("unit2", col("map.2"))
.withColumn("unit3", col("map.3"))
.withColumn("unit4", col("map.4"))
.drop("map", "units_flag_tuples").show
The result is:
+---+-----+-----+-----+-----+
| id|unit1|unit2|unit3|unit4|
+---+-----+-----+-----+-----+
|id1| null| null| 2.0| 6.0|
|id2| 10.0| 2.0| 5.0| null|
+---+-----+-----+-----+-----+

How to check whether a the whole column in a pyspark contains a value using Expr

In pyspark how can i use expr to check whether a whole column contains the value in columnA of that row.
pseudo code below
df=df.withColumn("Result", expr(if any the rows in column1 contains the value colA (for this row) then 1 else 0))
Take an arbitrary example:
valuesCol = [('rose','rose is red'),('jasmine','I never saw Jasmine'),('lily','Lili dont be silly'),('daffodil','what a flower')]
df = sqlContext.createDataFrame(valuesCol,['columnA','columnB'])
df.show()
+--------+-------------------+
| columnA| columnB|
+--------+-------------------+
| rose| rose is red|
| jasmine|I never saw Jasmine|
| lily| Lili dont be silly|
|daffodil| what a flower|
+--------+-------------------+
Application of expr() here. How one can use expr(), just look for the corresponding SQL syntax and it should work with expr() mostly.
df = df.withColumn('columnA_exists',expr("(case when instr(lower(columnB), lower(columnA))>=1 then 1 else 0 end)"))
df.show()
+--------+-------------------+--------------+
| columnA| columnB|columnA_exists|
+--------+-------------------+--------------+
| rose| rose is red| 1|
| jasmine|I never saw Jasmine| 1|
| lily| Lili dont be silly| 0|
|daffodil| what a flower| 0|
+--------+-------------------+--------------+

Map a multimap to columns of dataframe

Simply, I want to convert a multimap like this:
val input = Map("rownum"-> List("1", "2", "3") , "plant"-> List( "Melfi", "Pomigliano", "Torino" ), "tipo"-> List("gomme", "telaio")).toArray
in the following Spark dataframe:
+-------+--------------+-------+
|rownum | plant | tipo |
+------ +--------------+-------+
| 1 | Melfi | gomme |
| 2 | Pomigliano | telaio|
| 3 | Torino | null |
+-------+--------------+-------+
replacing missing values with "null" values. My issue is apply a map function to the RDD:
val inputRdd = sc.parallelize(input)
inputRdd.map(..).toDF()
Any suggestions? Thanks in advance
Although, see my comments, I'm really not sure the multimap format is well suited to your problem (did you have a look at Spark XML parsing modules ?)
The pivot table solution
The idea is to flatten you input table into a (elementPosition, columnName, columnValue) format :
// The max size of the multimap lists
val numberOfRows = input.map(_._2.size).max
// For each index in the list, emit a tuple of (index, multimap key, multimap value at index)
val flatRows = (0 until numberOfRows).flatMap(rowIdx => input.map({ case (colName, allColValues) => (rowIdx, colName, if(allColValues.size > rowIdx) allColValues(rowIdx) else null)}))
// Probably faster at runtime to write it this way (less iterations) :
// val flatRows = input.flatMap({ case (colName, existingValues) => (0 until numberOfRows).zipAll(existingValues, null, null).map(t => (t._1.asInstanceOf[Int], colName, t._2)) })
// To dataframe
val flatDF = sc.parallelize(flatRows).toDF("elementIndex", "colName", "colValue")
flatDF.show
Will output :
+------------+-------+----------+
|elementIndex|colName| colValue|
+------------+-------+----------+
| 0| rownum| 1|
| 0| plant| Melfi|
| 0| tipo| gomme|
| 1| rownum| 2|
| 1| plant|Pomigliano|
| 1| tipo| telaio|
| 2| rownum| 3|
| 2| plant| Torino|
| 2| tipo| null|
+------------+-------+----------+
Now this is a pivot table problem :
flatDF.groupBy("elementIndex").pivot("colName").agg(expr("first(colValue)")).drop("elementIndex").show
+----------+------+------+
| plant|rownum| tipo|
+----------+------+------+
|Pomigliano| 2|telaio|
| Torino| 3| null|
| Melfi| 1| gomme|
+----------+------+------+
This might not be the best looking solution, but it is fully scalable to any number of columns.

How to get the last row from DataFrame?

I hava a DataFrame,the DataFrame hava two column 'value' and 'timestamp',,the 'timestmp' is ordered,I want to get the last row of the DataFrame,what should I do?
this is my input:
+-----+---------+
|value|timestamp|
+-----+---------+
| 1| 1|
| 4| 2|
| 3| 3|
| 2| 4|
| 5| 5|
| 7| 6|
| 3| 7|
| 5| 8|
| 4| 9|
| 18| 10|
+-----+---------+
this is my code:
val arr = Array((1,1),(4,2),(3,3),(2,4),(5,5),(7,6),(3,7),(5,8),(4,9),(18,10))
var df=m_sparkCtx.parallelize(arr).toDF("value","timestamp")
this is my expected result:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
Try this, it works for me.
df.orderBy($"value".desc).show(1)
I would use simply the query that
- orders your table by descending order
- takes 1st value from this order
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY value DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()
I'd simply reduce:
df.reduce { (x, y) =>
if (x.getAs[Int]("timestamp") > y.getAs[Int]("timestamp")) x else y
}
The most efficient way is to reduce your DataFrame. This gives you a single row which you can convert back to a DataFrame, but as it contains only 1 record, this does not make much sense.
sparkContext.parallelize(
Seq(
df.reduce {
(a, b) => if (a.getAs[Int]("timestamp") > b.getAs[Int]("timestamp")) a else b
} match {case Row(value:Int,timestamp:Int) => (value,timestamp)}
)
)
.toDF("value","timestamp")
.show
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
Less efficient (as it needs shuffling) although shorter is this solution:
df
.where($"timestamp" === df.groupBy().agg(max($"timestamp")).map(_.getInt(0)).collect.head)
If your timestamp column is unique and is in increasing order then there are following ways to get the last row
println(df.sort($"timestamp", $"timestamp".desc).first())
// Output [1,1]
df.sort($"timestamp", $"timestamp".desc).take(1).foreach(println)
// Output [1,1]
df.where($"timestamp" === df.count()).show
Output:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
If not create a new column with the index and select the last index as below
val df1 = spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
StructType(df.schema.fields :+ StructField("index", LongType, false)))
df1.where($"timestamp" === df.count()).drop("index").show
Output:
+-----+---------+
|value|timestamp|
+-----+---------+
| 18| 10|
+-----+---------+
Java:
Dataset<Row> sortDF = inputDF.orderBy(org.apache.spark.sql.functions.col(config.getIncrementingColumn()).desc());
Row row = sortDF.first()
You can also use this function desc: Column desc(String columnName)
df.orderBy(desc("value")).show(1)
which gives same result as
df.orderBy($"value".desc).show(1)

PySpark DataFrame Manipulation Efficiency

Suppose I have the following data frame :
+----------+-----+----+-------+
|display_id|ad_id|prob|clicked|
+----------+-----+----+-------+
| 123| 989| 0.9| 0|
| 123| 990| 0.8| 1|
| 123| 999| 0.7| 0|
| 234| 789| 0.9| 0|
| 234| 777| 0.7| 0|
| 234| 769| 0.6| 1|
| 234| 798| 0.5| 0|
+----------+-----+----+-------+
I then perform the following operations to get a final data set (shown below the code) :
# Add a new column with the clicked ad_id if clicked == 1, 0 otherwise
df_adClicked = df.withColumn("ad_id_clicked", when(df.clicked==1, df.ad_id).otherwise(0))
# DF -> RDD with tuple : (display_id, (ad_id, prob), clicked)
df_blah = df_adClicked.rdd.map(lambda x : (x[0],(x[1],x[2]),x[4])).toDF(["display_id", "ad_id","clicked_ad_id"])
# Group by display_id and create column with clicked ad_id and list of tuples : (ad_id, prob)
df_blah2 = df_blah.groupby('display_id').agg(F.collect_list('ad_id'), F.max('clicked_ad_id'))
# Define function to sort list of tuples by prob and create list of only ad_ids
def sortByRank(ad_id_list):
sortedVersion = sorted(ad_id_list, key=itemgetter(1), reverse=True)
sortedIds = [i[0] for i in sortedVersion]
return(sortedIds)
# Sort the (ad_id, prob) tuples by using udf/function and create new column ad_id_sorted
sort_ad_id = udf(lambda x : sortByRank(x), ArrayType(IntegerType()))
df_blah3 = df_blah2.withColumn('ad_id_sorted', sort_ad_id('collect_list(ad_id)'))
# Function to change clickedAdId into an array of size 1
def createClickedSet(clickedAdId):
setOfDocs = [clickedAdId]
return setOfDocs
clicked_set = udf(lambda y : createClickedSet(y), ArrayType(IntegerType()))
df_blah4 = df_blah3.withColumn('ad_id_set', clicked_set('max(clicked_ad_id)'))
# Select the necessary columns
finalDF = df_blah4.select('display_id', 'ad_id_sorted','ad_id_set')
+----------+--------------------+---------+
|display_id|ad_id_sorted |ad_id_set|
+----------+--------------------+---------+
|234 |[789, 777, 769, 798]|[769] |
|123 |[989, 990, 999] |[990] |
+----------+--------------------+---------+
Is there a more efficient way of doing this? Doing this set of transformations in the way that I am seems to be the bottle neck in my code. I would greatly appreciate any feedback.
I haven't done any timing comparisons, but I would think that by not using any UDFs Spark should be able to optimally optimize itself.
#scala: val dfad = sc.parallelize(Seq((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF("display_id","ad_id","prob","clicked")
#^^^that's^^^ the only difference (besides putting val in front of variables) between this python response and a Scala one
dfad = sc.parallelize(((123,989,0.9,0),(123,990,0.8,1),(123,999,0.7,0),(234,789,0.9,0),(234,777,0.7,0),(234,769,0.6,1),(234,798,0.5,0))).toDF(["display_id","ad_id","prob","clicked"])
dfad.registerTempTable("df_ad")
df1 = sqlContext.sql("SELECT display_id,collect_list(ad_id) ad_id_sorted FROM (SELECT * FROM df_ad SORT BY display_id,prob DESC) x GROUP BY display_id")
+----------+--------------------+
|display_id| ad_id_sorted|
+----------+--------------------+
| 234|[789, 777, 769, 798]|
| 123| [989, 990, 999]|
+----------+--------------------+
df2 = sqlContext.sql("SELECT display_id, max(ad_id) as ad_id_set from df_ad where clicked=1 group by display_id")
+----------+---------+
|display_id|ad_id_set|
+----------+---------+
| 234| 769|
| 123| 990|
+----------+---------+
final_df = df1.join(df2,"display_id")
+----------+--------------------+---------+
|display_id| ad_id_sorted|ad_id_set|
+----------+--------------------+---------+
| 234|[789, 777, 769, 798]| 769|
| 123| [989, 990, 999]| 990|
+----------+--------------------+---------+
I didn't put the ad_id_set into an Array because you were calculating the max and max should only return 1 value. I'm sure if you really need it in an array you can make that happen.
I included the subtle Scala difference if a future someone using Scala has a similar problem.