adding new column to a pyspark dataframe based on other column - pyspark

I would like to add a new column to a dataframe based on another column using WHEN. I have the folowing code:
from pyspark.sql.functions import col, expr, when
df2=df.withColumn("test1",when(col("Country")=="DE","EUR").when(col("Country")=="PL","PLN").otherweise("Unknown"))
but I get the error:
'Column' object is not callable
How can I fix the problem?

You have a typo in your statement.
otherweise change to otherwise
df=spark.createDataFrame([("DE",),("PL",),("PO",)],["Country"])
df.withColumn("test1",when(col("country") == "DE", "EUR").when(col("country") == "PL", "PLN").otherwise("Unknown")).show()
#+-------+-------+
#|Country| test1|
#+-------+-------+
#| DE| EUR|
#| PL| PLN|
#| PO|Unknown|
#+-------+-------+

Related

spark bad records : bad records shows reason for only one column

I am trying to filter out bad records from a csv file using pyspark. Code snippet given below
from pyspark.sql import SparkSession
schema="employee_id int,name string,address string,dept_id int"
spark = SparkSession.builder.appName("TestApp").getOrCreate()
data = spark.read.format("csv").option("header", True).schema(schema).option("badRecordsPath", "/tmp/bad_records").load("/path/to/csv/file")
schema_for_bad_record="path string,record string,reason string"
bad_records_frame=spark.read.schema(schema_for_bad_record).json("/tmp/bad_records")
bad_records_frame.select("reason").show()
The valid dataframe is
+-----------+-------+-------+-------+
|employee_id| name|address|dept_id|
+-----------+-------+-------+-------+
| 1001| Floyd| Delhi| 1|
| 1002| Watson| Mumbai| 2|
| 1004|Thomson|Chennai| 3|
| 1005| Bonila| Pune| 4|
+-----------+-------+-------+-------+
In one of the records, both employee_id and dept_id has incorrect values. But the reason shows only one column's issue.
java.lang.NumberFormatException: For input string: \\"abc\\"
Is there any way to show reasons for multiple columns in case of failure?

How do I pass a column to substr function in pyspark

I have 2 columns in a dataframe, ValueText and GLength. I need to add a new column VX based on other 2 columns (ValueText and GLength). Basically, new column VX is based on substring of ValueText. Below is what I tried
df_stage1.withColumn("VX", df_stage1.ValueText.substr(6,df_stage1.GLength))
However with above code, I get error: startPos and length must be the same type. Got class 'int' and class 'pyspark.sql.column.Column', respectively.
I have also tried
func.expr("substring(ValueText,5, 5 + GLength)")
When I execute above code, i get the error: Pyspark job aborted due to stage failure
expr will work in this case, as we are using Glength in substring function.
Example:
df=spark.createDataFrame([("abcdff",4),("dlaldajfa",3)],["valuetext","Glength"])
df.show()
#+---------+-------+
#|valuetext|Glength|
#+---------+-------+
#| abcdff| 4|
#|dlaldajfa| 3|
#+---------+-------+
from pyspark.sql.functions import *
df.withColumn("vx",expr("substring(valuetext,0,Glength)")).show()
#+---------+-------+----+
#|valuetext|Glength| vx|
#+---------+-------+----+
#| abcdff| 4|abcd|
#|dlaldajfa| 3| dla|
#+---------+-------+----+

How to select a column based on value of another in Pyspark?

I have a dataframe, where some column special_column contains values like one, two. My dataframe also has columns one_processed and two_processed.
I would like to add a new column my_new_column which values are taken from other columns from my dataframe, based on processed values from special_column. For example, if special_column == one I would like my_new_column to be set to one_processed.
I tried .withColumn("my_new_column", F.col(F.concat(F.col("special_column"), F.lit("_processed")))), but Spark complains that i cannot parametrize F.col with a column.
How could I get the string value of the concatenation, so that I can select the desired column?
from pyspark.sql.functions import when, col, lit, concat_ws
sdf.withColumn("my_new_column", when(col("special_column")=="one", col("one_processed"
).otherwise(concat_ws("_", col("special_column"), lit("processed"))
The easiest way in your case would be just a simple when/oterwise like:
>>> df = spark.createDataFrame([(1, 2, "one"), (1,2,"two")], ["one_processed", "two_processed", "special_column"])
>>> df.withColumn("my_new_column", F.when(F.col("special_column") == "one", F.col("one_processed")).otherwise(F.col("two_processed"))).show()
+-------------+-------------+--------------+-------------+
|one_processed|two_processed|special_column|my_new_column|
+-------------+-------------+--------------+-------------+
| 1| 2| one| 1|
| 1| 2| two| 2|
+-------------+-------------+--------------+-------------+
As far as I know there is no way to get a column value by name, as execution plan would depend on the data.

Convert int column to list type pyspark

My DataFrame has a column num_of_items. It is a count field. Now, I want to convert it to list type from int type.
I tried using array(col) and even creating a function to return a list by taking int value as input. Didn't work
from pyspark.sql.types import ArrayType
from array import array
def to_array(x):
return [x]
df=df.withColumn("num_of_items", monotonically_increasing_id())
df
col_1 | num_of_items
A | 1
B | 2
Expected output
col_1 | num_of_items
A | [23]
B | [43]
I tried using array(col)
Using pyspark.sql.functions.array seems to work for me.
from pyspark.sql.functions import array
df.withColumn("num_of_items", array("num_of_items")).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
and even creating a function to return a list by taking int value as input.
If you want to use the function you created, you have to make it a udf and specify the return type:
from pyspark.sql.types import ArrayType, IntegerType
from pyspark.sql.functions import udf, col
to_array_udf = udf(to_array, ArrayType(IntegerType()))
df.withColumn("num_of_items", to_array_udf(col("num_of_items"))).show()
#+-----+------------+
#|col_1|num_of_items|
#+-----+------------+
#| A| [1]|
#| B| [2]|
#+-----+------------+
But it's preferable to avoid using udfs when possible: See Spark functions vs UDF performance?

PySpark - Add map function as column

I have a pyspark DataFrame
a = [
('Bob', 562),
('Bob',880),
('Bob',380),
('Sue',85),
('Sue',963)
]
df = spark.createDataFrame(a, ["Person", "Amount"])
I need to create a column that hashes the Amount and returns the amount. The problem is I can't use a UDF so I have used a mapping function.
df.rdd.map(lambda x: hash(x["Amount"]))
If you can't use udf you can use the map function, but as you've currently written it, there will only be one column. To keep all the columns, do the following:
df = df.rdd\
.map(lambda x: (x["Person"], x["Amount"], hash(str(x["Amount"]))))\
.toDF(["Person", "Amount", "Hash"])
df.show()
#+------+------+--------------------+
#|Person|Amount| Hash|
#+------+------+--------------------+
#| Bob| 562|-4340709941618811062|
#| Bob| 880|-7718876479167384701|
#| Bob| 380|-2088598916611095344|
#| Sue| 85| 7168043064064671|
#| Sue| 963|-8844931991662242457|
#+------+------+--------------------+
Note: In this case, hash(x["Amount"]) is not very interesting so I changed it to hash Amount converted to a string.
Essentially you have to map the row to a tuple containing all of the existing columns and add in the new column(s).
If your columns are too many to enumerate, you could also just add a tuple to the existing row.
df = df.rdd\
.map(lambda x: x + (hash(str(x["Amount"])),))\
.toDF(df.columns + ["Hash"])\
I should also point out that if hashing the values is your end goal, there is also a pyspark function pyspark.sql.functions.hash that can be used to avoid the serialization to rdd:
import pyspark.sql.functions as f
df.withColumn("Hash", f.hash("Amount")).show()
#+------+------+----------+
#|Person|Amount| Hash|
#+------+------+----------+
#| Bob| 562| 51343841|
#| Bob| 880|1241753636|
#| Bob| 380| 514174926|
#| Sue| 85|1944150283|
#| Sue| 963|1665082423|
#+------+------+----------+
This appears to use a different hashing algorithm than the python builtin.