How to call print from otherwise in spark scala? - scala

I have dataframe where I'm checking the condition and accordingly I'm performing the operation.I'm using "when and otherwise" function where I'm trying to print failed result in "otherwise" but it is not printing.Any help will be appreciating.
joinDF=
+--------+-------------+----------+--------+-------+
| ID | A | B | C | D |
+--------+-------------+----------+--------+-------+
| 9574| F| 005912| 2016022| 10|
| 9576| F| 005912| 2016022| 21|
| 9578| F| 005912| 2016022| 0|
| 9580| F| 005912| 2016022| 19|
| 9582| F| 005912| 2016022| 89|
+--------+-------------+----------+--------+-------+
joinDF
.withColumn("Validate",when(joinDF("D") =!= 0 ,lit(ture)).otherwise(print(joinDF("ID"))))

It quite simple and very straight-forward :
Seq(("A",1),("B",0)).toDF("key","value")
.withColumn("verdict", when($"value"=!=0, lit("true")).otherwise("false")).show
+---+-----+-------+
|key|value|verdict|
+---+-----+-------+
| A| 1| true|
| B| 0| false|
+---+-----+-------+
You don't need if, else or udfs
With your example :
Seq(("A",1),("B",0)).toDF("ID","D").withColumn("validate",when($"D" =!= 0 ,lit("true")).otherwise($"ID")).show
+---+---+--------+
| ID| D|validate|
+---+---+--------+
| A| 1| true|
| B| 0| B|
+---+---+--------+

If you are trying to print the value if they did not matched than you can create udf as below
val validate = udf((value :String )=>
{
//if if value is not equal to 0 return true else print value and return it
if (value != 0) "true"
else {
print(s"value = ${value}")
value
}
})
//this adds new column named validate and calls the udf validate
joinDF.withColumn("Validate", validate($"ID"))
Hope this helps!

Related

how to sequentially iterate rows in Pyspark Dataframe

I have a Spark DataFrame like this:
+-------+------+-----+---------------+
|Account|nature|value| time|
+-------+------+-----+---------------+
| a| 1| 50|10:05:37:293084|
| a| 1| 50|10:06:46:806510|
| a| 0| 50|11:19:42:951479|
| a| 1| 40|19:14:50:479055|
| a| 0| 50|16:56:17:251624|
| a| 1| 40|16:33:12:133861|
| a| 1| 20|17:33:01:385710|
| b| 0| 30|12:54:49:483725|
| b| 0| 40|19:23:25:845489|
| b| 1| 30|10:58:02:276576|
| b| 1| 40|12:18:27:161290|
| b| 0| 50|12:01:50:698592|
| b| 0| 50|08:45:53:894441|
| b| 0| 40|17:36:55:827330|
| b| 1| 50|17:18:41:728486|
+-------+------+-----+---------------+
I want to compare nature column of one row to other rows with the same Account and value,I should look forward, and add new column named Repeated. The new column get true for both rows, if nature changed, from 1 to 0 or vise versa. For example, the above dataframe should look like this:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| true |
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| true |
| a| 0| 50|16:56:17:251624| true |
| b| 0| 50|08:45:53:894441| true |
| b| 0| 50|12:01:50:698592| false|
| b| 1| 50|17:18:41:728486| true |
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| true |
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| true |
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
My solution is that I have to do group by or window on Account and value columns; then in each group, compare nature of each row to nature of other rows and as a result of comperation, Repeated column become full.
I did this calculation with Spark Window functions. Like this:
windowSpec = Window.partitionBy("Account","value").orderBy("time")
df.withColumn("Repeated", coalesce(f.when(lead(df['nature']).over(windowSpec)!=df['nature'],lit(True)).otherwise(False))).show()
The result was like this which is not the result that I wanted:
+-------+------+-----+---------------+--------+
|Account|nature|value| time|Repeated|
+-------+------+-----+---------------+--------+
| a| 1| 50|10:05:37:293084| false|
| a| 1| 50|10:06:46:806510| true|
| a| 0| 50|11:19:42:951479| false|
| a| 0| 50|16:56:17:251624| false|
| b| 0| 50|08:45:53:894441| false|
| b| 0| 50|12:01:50:698592| true|
| b| 1| 50|17:18:41:728486| false|
| a| 1| 40|16:33:12:133861| false|
| a| 1| 40|19:14:50:479055| false|
| b| 1| 40|12:18:27:161290| true|
| b| 0| 40|17:36:55:827330| false|
| b| 0| 40|19:23:25:845489| false|
| b| 1| 30|10:58:02:276576| true|
| b| 0| 30|12:54:49:483725| false|
| a| 1| 20|17:33:01:385710| false|
+-------+------+-----+---------------+--------+
UPDATE:
To explain more, if we suppose the first Spark Dataframe is named "df",in the following, I write what exactly want to do in each group of "Account" and "value":
a = df.withColumn('repeated',lit(False))
for i in range(len(group)):
j = i+1
for j in j<=len(group):
if a.loc[i,'nature']!=a.loc[j,'nature'] and a.loc[j,'repeated']==False:
a.loc[i,'repeated'] = True
a.loc[j,'repeated'] = True
Would you please guide me how to do that using Pyspark Window?
Any help is really appreciated.
You actually need to guarantee that the order you see in your dataframe is the actual order. Can you do that? You need a column to sequence that what happened did happen in that order. Inserting new data into a dataframe doesn't guarantee it's order.
A window & Lag will allow you to look at the previous rows value and make the required adjustment.
FYI: I use coalesce here as if it's the first row there is no value for it to compare with. consider using the second parameter to coalesce as you see fit with what should happen with the first value in the account.)
If you need it look at monotonically increasing function. It may help you to create the order by value that is required for us to deterministically look at this data.
from pyspark.sql.functions import lag
from pyspark.sql.functions import lit
from pyspark.sql.functions import coalesce
from pyspark.sql.window import Window
spark.sql("create table nature (Account string,nature int, value int, order int)");
spark.sql("insert into nature values ('a', 1, 50,1), ('a', 1, 40,2),('a',0,50,3),('b',0,30,4),('b',0,40,5),('b',1,30,6),('b',1,40,7)")
windowSpec = Window.partitionBy("Account").orderBy("order")
nature = spark.table("nature");
nature.withColumn("Repeated", coalesce( lead(nature['nature']).over(windowSpec) != nature['nature'], lit(True)) ).show()
|Account|nature|value|order|Repeated|
+-------+------+-----+-----+--------+
| b| 0| 30| 4| false|
| b| 0| 40| 5| true|
| b| 1| 30| 6| false|
| b| 1| 40| 7| true|
| a| 1| 50| 1| false|
| a| 1| 40| 2| true|
| a| 0| 50| 3| true|
+-------+------+-----+-----+--------+
EDIT:
It's not clear from your description if I should look forward or backward. I have changed my code to look forward a row as this is consistent with account 'B' in your output. However it doesn't seem like the logic for Account 'A' is identical to the logic for 'B' in your sample output. (Or I don't understand a subtly of starting on '1' instead of starting on '0'.) If you want to look forward a row use lead, if you want to look back a row use lag.
Problem solved.
Even though this way costs a lot,but it's ok.
def check(part):
df = part
size = len(df)
for i in range(size):
if (df.loc[i,'repeated'] == True):
continue
else:
for j in range((i+1),size):
if (df.loc[i,'nature']!=df.loc[j,'nature']) & (df.loc[j,'repeated']==False):
df.loc[j,'repeated'] = True
df.loc[i,'repeated'] = True
break
return df
df.groupby("Account","value").applyInPandas(check, schema="Account string, nature int,value long,time string,repeated boolean").show()
Update1:
Another solution without any iterations.
def check(df):
df = df.sort_values('verified_time')
df['index'] = df.index
df['IS_REPEATED'] = 0
df1 = df.sort_values(['nature'],ascending=[True]).reset_index(drop=True)
df2 = df.sort_values(['nature'],ascending=[False]).reset_index(drop=True)
df1['IS_REPEATED']=df1['nature']^df2['nature']
df3 = df1.sort_values(['index'],ascending=[True])
df = df3.drop(['index'],axis=1)
return df
df = df.groupby("account", "value").applyInPandas(gf.check2,schema=gf.get_schema('trx'))
UPDATE2:
Solution with Spark window:
def is_repeated_feature(df):
windowPartition = Window.partitionBy("account", "value", 'nature').orderBy('nature')
df_1 = df.withColumn('rank', F.row_number().over(windowPartition))
w = (Window
.partitionBy('account', 'value')
.orderBy('nature')
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df_1 = df_1.withColumn("count_nature", F.count('nature').over(w))
df_1 = df_1.withColumn('sum_nature', F.sum('nature').over(w))
df_1 = df_1.select('*')
df_2 = df_1.withColumn('min_val',
when((df_1.sum_nature > (df_1.count_nature - df_1.sum_nature)),
(df_1.count_nature - df_1.sum_nature)).otherwise(df_1.sum_nature))
df_2 = df_2.withColumn('more_than_one', when(df_2.count_nature > 1, '1').otherwise('0'))
df_2 = df_2.withColumn('is_repeated',
when(((df_2.more_than_one == 1) & (df_2.count_nature > df_2.sum_nature) & (
df_2.rank <= df_2.min_val)), '1')
.otherwise('0'))
return df_2

How to combine dataframes with no common columns?

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each
You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+
A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

Check is anyone of the dataframe columns are empty

I need to check if any of the columns of the dataframe are empty. Empty can be defined as if all rows of the column has values that are either null or empty string
Dataframe is as follows
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 1| a1| b1| c1| null|
| 2| | | | |
| 3| a3| | | |
+---+-------+-------+-------+-------+
The code I use for the check
mainDF.select(mainDF.columns.map(c => sum((col(c).isNotNull && col(c)!="").cast("int")).alias(c)): _*).show()
What I get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 3| 3| 3| 2|
+---+-------+-------+-------+-------+
What I hope to get is
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+
Also, My final results should be a true or false on if anyone of the columns is empty.
In this case it will be true because count of Sample4 is 0
You could map correct values to 1 and empty strings/nulls to 0 and then perform sum.
mainDF.select(mainDF.columns.map(c => sum(when(col(c)==="" or col(c).isNull,0).otherwise(1)).as(c)):_*).show()
+---+-------+-------+-------+-------+
| ID|Sample1|Sample2|Sample3|Sample4|
+---+-------+-------+-------+-------+
| 3| 2| 1| 1| 0|
+---+-------+-------+-------+-------+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

Update column value from another columns based on multiple conditions in spark structured streaming

I want to update the value in one column using another two columns based on multiple conditions. For eg - the stream is like :
+---+---+----+---+
| A | B | C | D |
+---+---+----+---+
| a | T | 10 | 0 |
| a | T | 100| 0 |
| a | L | 0 | 0 |
| a | L | 1 | 0 |
+---+---+----+---+
What I have is multiple conditions like -
(B = "T" && C > 20 ) OR (B = "L" && C = 0)
The values "T", 20, "L" and 0 are dynamic. AND/OR operators are also supplied at run-time. I want to make D = 1 whenever the condition holds true else it should remain D = 0. The number of conditions are also dynamic.
I tried using it with the UPDATE command in spark-sql i.e. UPDATE df SET D = '1' WHERE CONDITIONS. But it says that update is not yet supported. The resulting dataframe should be -
+---+---+----+---+
| A | B | C | D |
+---+---+----+---+
| a | T | 10 | 0 |
| a | T | 100| 1 |
| a | L | 0 | 1 |
| a | L | 1 | 0 |
+---+---+----+---+
Is there any way I can achieve this ?
I hope you are using Python. Will also post same for Scala as well! Use udf
PYTHON
>>> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
>>> def get_column(B, C):
... return int((B == "T" and C > 20) or (B == "L" and C == 0))
...
>>> fun = udf(get_column)
>>> res = df.withColumn("D", fun(df['B'], df['C']))>>> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
SCALA
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
scala> def get_column(B : String, C : Int) : Int = {
| if((B == "T" && C > 20) || (B == "L" && C == 0))
| 1
| else
| 0
| }
get_column: (B: String, C: Int)Int
scala> val fun = udf(get_column _)
fun: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(StringType, IntegerType)
))
scala> val res = df.withColumn("D", fun(df("B"), df("C")))
res: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
You can also use case when and otherwise like this:
PYTHON
>>> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
>>> new_column = when(
(col("B") == "T") & (col("C") > 20), 1
).when((col("B") == "L") & (col("C") == 0), 1).otherwise(0)
>>> res = df.withColumn("D", new_column)
>>> res.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+
SCALA
scala> df.show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 0|
| a| L| 0| 0|
| a| L| 1| 0|
+---+---+---+---+
scala> val new_column = when(
| col("B") === "T" && col("C") > 20, 1
| ).when(col("B") === "L" && col("C") === 0, 1 ).otherwise(0)
new_column: org.apache.spark.sql.Column = CASE WHEN ((B = T) AND (C > 20)) THEN 1 WHEN ((B = L) AND (C = 0)) THEN 1 ELSE 0 END
scala> df.withColumn("D", new_column).show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| a| T| 10| 0|
| a| T|100| 1|
| a| L| 0| 1|
| a| L| 1| 0|
+---+---+---+---+