Pyspark : How I create a column in Pyspak with this condition ? Pyspark - pyspark

If df.acc = "1" and df.bcc = "1" then "Normal"
If df.acc = "1" and df.bcc = "2" then "Critical"
If df.acc = "1" and df.bcc = "3" then "AOG"
If df.acc = "4" and df.bcc = "1" then "Warranty"
If df.acc = "4" and df.bcc = "2" then "Routine"
If df.acc = "4" and df.bcc = "3" then "Contract"

Given a dataframe like this.
+---+---+
|acc|bcc|
+---+---+
| 1| 1|
| 1| 2|
| 1| 3|
| 4| 1|
| 4| 2|
| 4| 3|
+---+---+
You can create an additional column based on your conditions like so. Note that "Other" is the value entered into the column if none of the conditions are met.
from pyspark.sql import functions
df = (df.withColumn("new_column_name",
functions.when((df["acc"] == "1") & (df["bcc"] == "1"), "Normal")
.when((df["acc"] == "1") & (df["bcc"] == "2"), "Critical")
.when((df["acc"] == "1") & (df["bcc"] == "3"), "AOG")
.when((df["acc"] == "4") & (df["bcc"] == "1"), "Warranty")
.when((df["acc"] == "4") & (df["bcc"] == "2"), "Routine")
.when((df["acc"] == "4") & (df["bcc"] == "3"), "Contract")
.otherwise("Other")))

Related

Creation of a column based on filtered values of other column in pyspark

I try to create a new variable called k which its values depends if metric is I or M, otherwise I want to return an empty value.
Thanks in advance for your answer :)
data = [["1", "Amit", "DU", "I", "8", "6"],
["2", "Mohit", "DU", "I", "4", "2"],
["3", "rohith", "BHU", "I", "5", "3"],
["4", "sridevi", "LPU", "I", "1", "6"],
["1", "sravan", "KLMP", "M", "2", "4"],
["5", "gnanesh", "IIT", "M", "6", "8"],
["6", "gnadesh", "KLM", "c", "10", "9"]]
columns = ['ID', 'NAME', 'college', 'metric', 'x', 'y']
dataframe = spark.createDataFrame(data, columns)
+---+-------+-------+------+---+---+
| ID| NAME|college|metric| x| y|
+---+-------+-------+------+---+---+
| 1| Amit| DU| I| 8| 6|
| 2| Mohit| DU| I| 4| 2|
| 3| rohith| BHU| I| 5| 3|
| 4|sridevi| LPU| I| 1| 6|
| 1| sravan| KLMP| M| 2| 4|
| 5|gnanesh| IIT| M| 6| 8|
| 6|gnadesh| KLM| c| 10| 9|
+---+-------+-------+------+---+---+
I tried to use this but it does not work
dataframe= dataframe.withColumn('k', when ((col('metric') == 'M',(dataframe['metric'] / 10)))
.when ((col('metric') == 'I',(dataframe['metric'] / 10 * 2,54)))
.otherwise (' '))
from pyspark.sql.functions import lit
dataframe= dataframe.withColumn('k', when ((col('metric') == 'M',(dataframe['metric'] / 10)))
.when ((col('metric') == 'I',(dataframe['metric'] / 10 * 2,54)))
.otherwise (lit(' ')))
Or
from pyspark.sql.functions import lit
dataframe= dataframe.withColumn('k', when ((col('metric') == 'M',(dataframe['metric'] / 10)))
.when ((col('metric') == 'I',(dataframe['metric'] / 10 * 2,54)))
.otherwise (lit(None)))
I am guessing you're getting the error in the otherwise part of the code. The argument for DataFrame.withColumn should be of type Column, which ' ' isn't.

If-If statement Scala Spark

I have a dataframe for which I have to create a new column based on values in the already existing columns. The catch is, I can't write CASE statements, because here it checks for first WHEN condition if it is not satisfied then it will go to next WHEN. E.g. consider this dataframe:
+-+-----+-+
|A|B |C|
+-+-----+-+
|1|true |1|-----> Condition 1 and 2 is satisfied Here
|1|true |0|-----> Condition 1 is satisfied here
|1|false|1|
|2|true |1|
|2|true |0|
+-+-----+-+
Consider this CASE statement:
CASE WHEN A = 1 and B = 'true' then 'A'
WHEN A = 1 and B = 'true' and C=1 then 'B'
END
It gives me no row for value B.
Expected output:
+-+-----+-+----+
|A|B |C|D |
+-+-----+-+----+
|1|true |1|A |
|1|true |1|B |
|1|true |0|A |
|1|false|1|null|
|2|true |1|null|
|2|true |0|null|
+-+-----+-+----+
I know I can derive this in 2 separate dataframes and then union them. But I am looking for more efficient solution.
Creating the dataframe:
val df1 = Seq((1, true, 1), (1, true, 0), (1, false, 1), (2, true, 1), (2, true, 0)).toDF("A", "B", "C")
df1.show()
// +---+-----+---+
// | A| B| C|
// +---+-----+---+
// | 1| true| 1|
// | 1| true| 0|
// | 1|false| 1|
// | 2| true| 1|
// | 2| true| 0|
// +---+-----+---+
The code:
val condition1 = ($"A" === 1) && ($"B" === true)
val condition2 = condition1 && ($"C" === 1)
val arr1 = array(when(condition1, "A"), when(condition2, "B"))
val arr2 = when(element_at(arr1, 2).isNull, slice(arr1, 1, 1)).otherwise(arr1)
val df2 = df.withColumn("D", explode(arr2))
df2.show()
// +---+-----+---+----+
// | A| B| C| D|
// +---+-----+---+----+
// | 1| true| 1| A|
// | 1| true| 1| B|
// | 1| true| 0| A|
// | 1|false| 1|null|
// | 2| true| 1|null|
// | 2| true| 0|null|
// +---+-----+---+----+

Fill blank rows in a column with a non-blank value above it in Spark

I have an input file having around 8.5+ Million records.
My requirement is to fill the empty row values in a column with the immediate non-blank value ABOVE it. Have a look at the example:
+-----+-----+---+------+
|FName|LName|Age|Gender|
+-----+-----+---+------+
| A| B| 29| M|
| A| C| 12| |
| B| D| 35| |
| Q| D| 85| F|
| W| R| 14| |
+-----+-----+---+------+
Desired Ouput:
+-----+-----+---+------+
|FName|LName|Age|Gender|
+-----+-----+---+------+
| A| B| 29| M|
| A| C| 12| M|
| B| D| 35| M|
| Q| D| 85| F|
| W| R| 14| F|
+-----+-----+---+------+
Increment column can be added, and function "last" with ignoring nulls can be used over window:
val idWindow = Window.orderBy($"ID")
df
.withColumn("id", monotonically_increasing_id())
.withColumn("Gender",
last(
when($"Gender" === "", null).otherwise($"Gender"),
ignoreNulls = true).over(idWindow)
)
.drop("id")
Add rowId column with a Gender_temp and mark even odd column as M and F
save it to Gender_temp
and drop the unused columns
import org.apache.spark.sql.functions._
object DataframeFill {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val personDF = Seq(("A", "B", 29, "M"),
("A", "C", 12, ""),
("B", "D", 35, "F"),
("Q", "D", 85, ""),
("W", "R", 14, "")).toDF("FName", "LName", "Age", "Gender")
personDF.show()
personDF
.withColumn("rowId", monotonically_increasing_id())
.withColumn("Gender_temp", when($"Gender".isin(""),
when ($"rowId" % 2 ===0 ,"M").otherwise("F") ).otherwise($"Gender"))
.drop("Gender")
.drop("rowId")
.withColumnRenamed("Gender_temp","Gender")
.show()
}
}

update an existing column in pyspark without changing older values

I am trying to update an existing column in pyspark but seems like old values in the column are getting updated as well despite no otherwise condition
+-----+-----+-----+-----+-----+----+
|cntry|cde_1|cde_2|rsn_1|rsn_2|FLAG|
+-----+-----+-----+-----+-----+----+
| MY| A| | 1| 2|null|
| MY| G| | 1| 2|null|
| MY| | G| 1| 2|null|
| TH| A| | 16| 2|null|
| TH| B| | 1| 16| 1|
| TH| | W| 16| 2| 1|
+-----+-----+-----+-----+-----+----+
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
df = df.withColumn('FLAG', F.when( (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") ) , 1))
df = df.withColumn('FLAG', F.when( (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") ) , 1))
You need to combine your conditions using the Boolean OR. Like this:
df = sc.parallelize([ ["MY","A","","1","2"], ["MY","G","","1","2"], ["MY","","G","1","2"], ["TH","A","","16","2"], ["TH","B","","1","16"], ["TH","","W","16","2"] ]).toDF(("cntry", "cde_1", "cde_2", "rsn_1", "rsn_2"))
cond1 = (df.cntry == "MY") & ( (df.cde_1.isin("G") ) | (df.cde_2.isin("G") ) ) & ( (df.rsn_1 == "1") | (df.rsn_2 == "1") )
cond2 = (df.cntry == "TH") & ( (df.cde_1.isin("B", "W") ) | (df.cde_2.isin("B", "W") ) ) & ( (df.rsn_1 == "16") | (df.rsn_2 == "16") )
df.withColumn("FLAG", F.when(cond1 | cond2, 1)).show()
In your last line, you overwrite the FLAG column, as you’re not referencing its previous state. That’s why the previous values are not preserved.
Instead of combining the expressions, you could also use when(cond1, 1).otherwise(when(cond2, 1)). That’s a stylistic choice.

replace spaces with null value using regexp_replace if i have multiple

How to replace spaces with Null if i have spaces in multiple column.
Input Dataset which i have
+---+-----++----+
| Id|col_1|col_2|
+---+-----+-----+
| 0|104 | |
| 1| | |
+---+-----+-----+
import org.apache.spark.sql.functions._
val test = df.withColumn("col_1","col_2", regexp_replace(df("col_1","col_1"), "^\\s*", lit(Null)))
test.filter("col_1,col_2 is null").show()
Output Dataset:
+---+-----++----+
| Id|col_1|col_2|
+---+-----+-----+
| 0|104 | Null|
| 1|Null | Null|
+---+-----+-----+
use one withColumn for each column:
import org.apache.spark.sql.functions._
val df = List(("0", "104", " "), ("1", " ", "")).toDF("Id","col_1", "col_2")
val test = df
.withColumn("col_1", when(regexp_replace (col("col_1"), "\\s+", "") === "", null).otherwise(col("col_1")))
.withColumn("col_2", when(regexp_replace (col("col_2"), "\\s+", "") === "", null).otherwise(col("col_2")))
.show
Result
+---+-----+-----+
| Id|col_1|col_2|
+---+-----+-----+
| 0| 104| null|
| 1| null| null|
+---+-----+-----+
Hi You can do like this:
scala> val someDFWithName = Seq((1, "anurag", ""), (5, "", "")).toDF("id", "name", "age")
someDFWithName: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> someDFWithName.show
+---+------+---+
| id| name|age|
+---+------+---+
| 1|anurag| |
| 5| | |
+---+------+---+
scala> someDFWithName.na.replace(Seq("name","age"),Map(""-> null)).show
+---+------+----+
| id| name| age|
+---+------+----+
| 1|anurag|null|
| 5| null|null|
+---+------+----+
Or try out this also:
scala> someDFWithName.withColumn("Name", when(col("Name") === "", null).otherwise(col("Name"))).withColumn("Age", when(col("Age") === "", null).otherwise(col("Age"))).show
+---+------+----+
| id| name| age|
+---+------+----+
| 1|anurag|null|
| 5| null|null|
+---+------+----+
Or For more than one space try out this:
scala> val someDFWithName = Seq(("n", "a"), ( "", "n"), (" ", ""), (" ", "a"), (" ",""), (" "," "), ("c"," ")).toDF("name", "place")
someDFWithName: org.apache.spark.sql.DataFrame = [name: string, place: string]
scala> someDFWithName.withColumn("Name", when(regexp_replace(col("name"),"\\s+","") === "", null).otherwise(col("Name"))).withColumn("Place", when(regexp_replace(col("place"),"\\s+","") === "", null).otherwise(col("place"))).show
+----+-----+
|Name|Place|
+----+-----+
| n| a|
|null| n|
|null| null|
|null| a|
|null| null|
|null| null|
| c| null|
+----+-----+
I hope this will help you. Thanks